Zippy George Bungle

Cluster Analysis

Aims and Objectives

  • Have a working knowledge of the ways in which similarity between cases can be quantified (e.g. single linkage, complete linkage and average linkage).
  • Be able to produce and interpret dendrograms produced by SPSS.
  • Know that different methods of clustering will produce different cluster structures.

Continue reading “Cluster Analysis”



Many of you will have seen former APS president Professor Susan Fiske’s recently leaked opinion piece in the APS observer and the outcry it has caused.

I’m late in on this, but in my defence I have a 6 week old child to help keep alive and I’m on shared parental leave, so I’m writing this instead of, you know, enjoying one of the rare moments I get to myself. I really enjoyed (and agreed with) Andrew Gelman’s blog post about it, and there are other good pieces by, amongst others, Chris Chambers. I won’t pretend to have read other ones, so forgive me if I’m repeating someone else. It wouldn’t be the first time.
Continue reading “StinkFiske”

“If you’re not doing something different, you’re not doing anything at all.”

Yesterday was the official launch of my new textbook An Adventure in Statistics: The Reality Enigma. Although a few ‘print to order’ copies are floating about, the ‘proper’ hi-res print copies won’t be available for a few more weeks, but I thought it was a good opportunity to blog something about the book and perhaps textbook writing more generally. I’m going to start by telling you something about the book. Then I will try to give you an idea of the timeline and some rough statistics that probably don’t do justice to the emotional and physical investment that goes into a textbook.

Cover of An Adventure in Statistics
An Adventure in Statistics

A history of ‘an adventure in statistics’

Visualization guru (and sculptor) Edward Tufte apparently has a small sign taped to his computer screen that says “If you’re not doing something different, you’re not doing anything at all.” It’s a note that I don’t have taped to my monitor, but I probably should because I like ‘different’, and I strive for ‘different’  not always in a good way.

In 2008 I was in Rotterdam updating my SPSS book (third edition) and like all of my books I had a long list of things from the previous edition that I hated and wanted to change.  It would be easy to just change the SPSS screenshots and slap a new cover on the front, but I wanted to do something different. After all “If you’re not doing something different, you’re not doing anything at all.”

I thought it would be interesting to try to embed the academic content of the book within a fictional story. I didn’t have a story though, and I had only 6 months to update the book. It would be impossible. So I copped out: I book-ended each chapter with an anecdote from the only story I had to hand – my life. Some thought it was different, but to me it was a poor imitation of what I could have done.

A couple of years later I was approached to write a stats book for the ‘for dummies’ range. I was tempted. I spoke to my editor at SAGE (who publish my statistics books) because of potential overlap with the SPSS book. This led to a conversation with Ziyad Marar who runs the London office of SAGE. I’ve known Ziyad a long time – he signed my first book – but trust me, he rarely phones me. That’s true of most people because I go to great lengths to tell everyone how uncomfortable telephones make me, but a call from Ziyad is a particularly rare and beautiful thing. The gist of that conversation was that Ziyad convinced me to write new book for SAGE instead. He said, something to the effect of:

‘Why not write that book for us? We will let you do whatever you like  express yourself fully.’
“What?” I asked, “You’d give me complete control even after ‘the incident’?”
“Yes”. He replied (after what I like to mis-remember as a dramatic pause).

Ziyad was offering me the opportunity to set my imagination free, to go somewhere that perhaps other publishers would not let me go, to try something without needing to justify it with research, or pedagogy. An opportunity to follow my heart and not my head, but what did my heart want to do? I briefly considered whether it was possible to put even more penis jokes into a statistics textbook, but I’d been there, done that, worn the phallus and “If you’re not doing something different, you’re not doing anything at all.”

I thought back to 2008, to the idea of writing a fictional story through which a student learns statistics through a shared adventure with the main character. I thought about collaborating with a graphic novel illustrator to bring the story to life. I didn’t know anything about writing fiction: but, I didn’t know anything about logistic regression and multilevel models before I wrote 60-page chapters about them. Not knowing something should never be an obstacle to writing about it.

I got on board a badass illustrator, James Iles, to create graphic novel strips to bring the story to life. There have been a few pivotal moments during the book’s life but none more than the moment that James replied to my advert on He fought off about 80 other applicants to get the gig, and although I deluded myself that the choice of illustrator was a complex, make-or-break, decision, my gut instinct always pointed to James. He’d done storyboarding for Doctor Who, and I fucking love Doctor Who. If James was good enough Doctor Who, he was certainly going to be good enough for me. Unknown to me at the time, I hadn’t just found an exceptionally talented artist, but I’d also found someone who would put as much passion and care into the book as I would.

What is ‘an adventure in statistics’ all about?

An adventure in statistics is set in a future in which the invention of the reality prism, a kind of hat that splits reality into the subjective and objective has bought society to collapse by showing everyone the truth. Without blind belief, no-one tried anymore. In the wake of this ‘reality revolution’ society fragmented into people who held onto the pre-technological past (the Clocktorians) and those who embraced the ever-accelerating technology of the new world (the chippers). Society had become a mix of the ultra-modern and the old fashioned.

Into this world I put Zach, a rock musician, and his girlfriend Dr. Alice Nightingale. They are part of the first generation since the revolution to believe that they can change the world. Zach through his music, and Alice through her research. Then Alice suddenly disappears leaving Zach with a broken heart, a song playing on repeat and a scientific report that makes no sense to him. Fearing the worst, he sets out to find her. Strange things happen: people collapse and lose their memories, he gets messages from someone called Milton, and the word JIG:SAW haunts him. Zach feels that something is terribly wrong and that Alice is in danger, but her vanishing triggers an even worse thought: that after 10 years they have drifted apart.

At a simple level ‘an adventure in statistics’ is a story about Zach searching for Alice, and seeking the truth, but it’s also about the unlikely friendship he develops with a sarcastic cat, it’s about him facing his fear of science and numbers, it’s about him learning to believe in himself. It’s a story about love, about not forgetting who you are. It’s about searching for the heartbeats that hide in the gaps between you and the people you love. It’s about having faith in others.

Of course, it’s also about fitting models, robust methods, classical and Bayesian estimation, significance testing and whole bunch of other tedious statistical things, but hopefully you’ll be so engrossed in the story that you won’t notice them. Or they might be a welcome relief from the terrible fiction. Time will tell.

What goes into creating a textbook?

What does writing a textbook involve? That is hard to explain. For an adventure in statistics I really enjoyed the writing (especially the fictional story), on the whole it has been the most rewarding experience of my academic career. However, rest assured that if you decide to write a textbook, you will hit some motivational dark times. Very dark times.

The timeline

I had the initial idea in 2008, I wrote the proposal in January 2011 (final version March 2011). The final contract with SAGE was signed in April 2011. Around this time, I started discussing with SAGE my idea to have graphic novel elements and a story. I started making notes about a potential story and characters in a black book and using Scrivener. I started writing in January 2014. By this point James Iles had just come on board.  (SAGE are doing some videos where James and I discuss how we worked, so I won’t say more on that.) At the point that I started writing I had a lot of ideas, most of the characters in place and a decent idea of what would happen at the beginning and end of the story, and some bits in the middle.  A lot of the story developed as I wrote. (One thing I learned in writing this book is that even though I thought I’d done a lot of planning, I should have done an awful lot more before writing the first word!) June 2014 my wife and I had our first child. I took unpaid paternity leave and did quite a long stretch of writing (4 months) where I’d spend the day doing dad stuff until about 3-4pm and then start work, writing until 1-3am. I generally work better at night. The first draft was finished around April 2015. We had feedback from a fiction editor (Gillian Stern) on the story which came to me May 2015. I did a re-draft of the entire book based on that, which I finished around August 2015. I then had a bunch more feedback on the story from Robin, my development editor at SAGE, and on the statistics stuff and story from my wife. I did a third and final draft which was submitted October 2015. January 2016 I received the copy editor’s comments for the entire book for approval (or not). March 2016 I received proofs of the entire book, which I spent 2-3 weeks reading/correcting working well into the night most nights. April 2016 I received the corrected proofs to approve. In a sense then, it’s consumed 8 years of my life (as an ambition), but really it’s more like 4 years of work, 2 of them intensive.

The anatomy of ‘an adventure in statistics’

  • I don’t know exactly how many hours I spent on the book, but I spent probably 2 years casually collecting ideas and thoughts, and developing ideas for the structure and so on. I spent another 21 months pretty much doing not a lot else but writing or re-drafting the book. I had my university job to do as well, so it’s impossible to really know how many hours it took to create, but it’s probably somewhere in the region of 4000 hours. That’s just to the point of submitting the manuscript.
  • I wrote 297,676 words, ~1.6 million characters, 13,421 paragraphs and 28,768 lines. In terms of word length that’s about 3-4 psychology PhD theses, or if you assume the average research paper is about 5000 words then it’s about 60 research papers. In 2 years. I will get precisely no credit in the REF for this activity. [I’m not saying I should, I’m just making the point that you really are putting your research career on hold and investing a lot of creativity/energy into something that isn’t valued by the system that universities value. I am fortunate to be able to do this but I think this is a really tough balancing act for early-career scientists who want to write books.]
  • Given the book had three drafts, and I had to read proofs, I have read at least 1.19 million of my own words. It’ll be a lot more than that because of stuff you write and then delete.
  • I used Scrivener to plan the story. My project document in which I gathered ideas (e.g., plot ideas, character outlines, descriptions of locations, venues, objects, concepts, artwork ideas etc.) contains another 87,204 words and quite a few images – in addition to the 297,676 word in the book itself.
  • I created 603 diagrams. [Not all of them are the book because this includes old versions of diagrams, and image files that I used in diagrams – for example, an image of a normal curve that I drop into a finished diagram]. I used Omnigraffle incidentally for my diagrams, and graphs and stat-y stuff would have been created using R, most often with ggplot2.
  • I created 185 data-related files (data files, R-scripts etc.)
  • I wrote ~4000 lines of R-code (to generate data, create graphs, run analyses etc.).
  • At some point I will have to produce a bunch of online stuff – powerpoint presentations, handouts, answers to tasks in the book etc.
  • Basically, it was a lot of fucking work.


The beginning and the end

James Iles (Right) and I (Left) at the book launch

Yesterday the book was launched: it is both a beginning and an end. Beginnings can be exciting. It is the beginning of the public life of ‘an adventure in statistics’. It might be the beginning of it being a well-received book? The beginning of it inspiring young scientists? The beginning of people thinking differently about teaching statistics? That’d be nice but my excitement is laced with fear because beginnings can be scary too: today could be the beginning of seeing the book through a reality prism that shows me the objective truth in the form of scathing reviews, poor sales, sympathetic looks, and five wasted years.

Yesterday was also an end. Primarily an end to my work on the book (well, apart from a bunch of online materials …). I have never liked endings. When I was a child and people would come to stay, I always felt hollow when they left. For over 2 years, the characters in this book – especially Zach, Alice, Milton and Celia – have been the houseguests of my mind. We’ve had a lot of fun times. We’ve worked hard and played hard. We’ve had lots of late night conversations, we’ve shared our deepest feelings, we’ve discussed life, and they’ve helped me to see the world through different eyes. Yesterday they left me to find their own way in the world and I’m going to miss them. I feel a little hollow. I never thought I’d miss writing a statistics textbook.

It’s a scary time. I am proud of and excited about the book, and of what James and I have created. I’m also a little terrified that no-one else will share my enthusiasm  after all, it’s different to other statistics textbooks. People don’t always like ‘different’. Tufte’s words are a comfort though because if it’s true that “If you’re not doing something different, you’re not doing anything at all.” then I feel that, with ‘an adventure in statistics’ I have at least done something.

[Some of this blog is adapted from a speech I gave at the launch, which you can watch here]


Download the preface and chapter 1.
Read the article in the Times Higher Education Supplement about the book.
The book can be ordered direct from SAGE, or from your local Amazon or other retailer.

Max or No Max?

There’s been a recent spat between the heavy metal bands Sepultura and Soulfly. For those unaware of the history, 50% of Sepulture used to be the Cavalera brothers (Max and Igor) until Max (the frontman and guitarist) left the band in 1996 and formed Soulfly. The full story is here. There’s a lot of bad blood even 20 years later, and according to a recent story on metal sucks, Soulfly’s manager (and Max’s wife) Gloria Cavalier recently posted a fairly pointed post on her Facebook page. This got picked up by my favourite podcast (the metal sucks podcast). What has this got to do with me, or statistics? Well, one of the presenters of the metal sucks podcasts asked me this over Twitter:

After a very brief comment about needing to operationalise ‘better’, I decided that rather than reading book proofs I’d do a tongue in cheek analysis of what is better max or no max and here it is.

First we need to operationalise ‘better’. I have done this by accepting subjective opinion as determining ‘better’ and specifically ratings of albums on (although I am English metal sucks is US based, so I thought I’d pander to them and take ratings from the US site). Our questions then becomes ‘is max or no max rated higher by the sorts of people who leave reviews on Amazon’. We have operationalised our questions and turned it into a scientific statement, which we can test with data. [There are all sorts of problems with using these ratings, not least of which is that they tend to be positively biased, and they likely reflect a certain type of person who reviews, often reviews reflect things other than the music (e.g., arrived quickly 5*), and so on … but fuck it, this is not serious science, just a bit of a laugh.]

Post Sepultura: Max or No Max

The first question is whether post-max Sepultura or Soulfly are rated higher. Figure 1 shows that the data are hideously skewed with people tending to give positive reviews and 4-5* ratings. Figure 2 shows the mean ratings by year post Max’s departure (note they released albums in different years so the dots are out of synch, but it’s a useful timeline). Figure 2 seems to suggest that after the first couple of albums, both bands are rated fairly similarly: the Soulfly line is higher but error bars overlap a lot for all but the first albums.

Figure 1: Histograms of all ratings for Soulfly and (Non-Max Era) Sepultura

Figure 2: Mean ratings of Soulfly and (Non-Max Era) Sepultura by year of album release

There are a lot of ways you could look at these data. The first thing is the skew. That messes up estimates of confidence intervals and significance tests … but our sample is likely big enough that we can rely on the central limit theorem to do its magic and let us assume that the sampling distribution is normal (beautifully explained in my new book!)

I’m going to fit three models. The first is an intercept only model (a baseline with no predictors), the second allows intercepts to vary across albums (which allows ratings to vary by album, which seems like a sensible thing to do because albums will vary in quality) the third predicts ratings from the band (Sepultura vs Soulfly).

  maxModel2a<-gls(Rating ~ 1, data = sepvssoul, method = "ML")
  maxModel2b<-lme(Rating ~ 1, random = ~1|Album, data = sepvssoul, method = "ML")
  maxModel2c<-update(maxModel2b, .~. + Band)
  anova(maxModel2a, maxModel2b, maxModel2c)

By comparing models we can see:

           Model df      AIC      BIC    logLik   Test  L.Ratio p-value
maxModel2a     1  2 2889.412 2899.013 -1442.706                        
maxModel2b     2  3 2853.747 2868.148 -1423.873 1 vs 2 37.66536  <.0001
maxModel2c     3  4 2854.309 2873.510 -1423.155 2 vs 3  1.43806  0.2305

That album ratings varied very significantly (not surprising), the p-value is < .0001, but that band did not significantly predict ratings overall (p = .231). If you like you can look at the summary of the model by executing:

Which gives us this output:

Linear mixed-effects model fit by maximum likelihood
 Data: sepvssoul 
       AIC     BIC    logLik
  2854.309 2873.51 -1423.154

Random effects:
 Formula: ~1 | Album
        (Intercept) Residual
StdDev:   0.2705842 1.166457

Fixed effects: Rating ~ Band 
               Value Std.Error  DF   t-value p-value
(Intercept) 4.078740 0.1311196 882 31.107015  0.0000
BandSoulfly 0.204237 0.1650047  14  1.237765  0.2362
BandSoulfly -0.795

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-2.9717684 -0.3367275  0.4998698  0.6230082  1.2686186 

Number of Observations: 898
Number of Groups: 16 

The difference in ratings between Sepultura and Soulfly was b = 0.20. Ratings for soulfully were higher, but not significantly so (if we allow ratings to vary over albums, if you take that random effect out you’ll get a very different picture because that variability will go into the fixed effect of ‘band’.).

Max or No Max

Just because this isn’t fun enough, we could also just look at whether either Sepultura (post 1996) or Soulfly can compete with the Max-era-Sepultura heyday.
I’m going to fit three models but this time including the early Sepultura albums (with max). The models are the same as before except that the fixed effect of band now has three levels: Sepultura Max, Sepultura no-Max and Soulfly:

  maxModela<-gls(Rating ~ 1, data = maxvsnomax, method = "ML")
  maxModelb<-lme(Rating ~ 1, random = ~1|Album, data = maxvsnomax, method = "ML")
  maxModelc<-update(maxModelb, .~. + Band)
  anova(maxModela, maxModelb, maxModelc)

By comparing models we can see:

          Model df      AIC      BIC    logLik   Test   L.Ratio p-value
maxModela     1  2 4686.930 4697.601 -2341.465                         
maxModelb     2  3 4583.966 4599.973 -2288.983 1 vs 2 104.96454  <.0001
maxModelc     3  5 4581.436 4608.114 -2285.718 2 vs 3   6.52947  0.0382

That album ratings varied very significantly (not surprising), the p-value is < .0001, and the band did significantly predict ratings overall (p = .038). If you like you can look at the summary of the model by executing:

Which gives us this output:

Linear mixed-effects model fit by maximum likelihood
 Data: maxvsnomax 
       AIC      BIC    logLik
  4581.436 4608.114 -2285.718

Random effects:
 Formula: ~1 | Album
        (Intercept) Residual
StdDev:     0.25458 1.062036

Fixed effects: Rating ~ Band 
                         Value Std.Error   DF  t-value p-value
(Intercept)           4.545918 0.1136968 1512 39.98281  0.0000
BandSepultura No Max -0.465626 0.1669412   19 -2.78916  0.0117
BandSoulfly          -0.262609 0.1471749   19 -1.78433  0.0903
                     (Intr) BndSNM
BandSepultura No Max -0.681       
BandSoulfly          -0.773  0.526

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-3.3954974 -0.3147123  0.3708523  0.6268751  1.3987442 

Number of Observations: 1534
Number of Groups: 22  

The difference in ratings between Sepultura without Max compared to with him was b = -0.47 and significant at p = .012 (ratings for post-max Sepultura are significantly worse than for the Max-era Sepultura). The difference in ratings between Soulfly compared two Max-era Sepultura was b = -0.26 and not significant (p = .09) (ratings for Soulfly are not significantly worse than for the Max-era Sepultura). A couple of points here, p-values are silly, so don’t read too much into them, but the parameter (the bs) which quantifies the effect is a bit smaller for Soulfly.

Confidence Intervals

Interestingly if you write yourself a little bootstrap routine to get some robust confidence intervals around the parameters:

    boot.lme <- function(data, indices){
    data <- data[indices,] # select obs. in bootstrap sample
    model <- lme(Rating ~ Band, random = ~1|Album, data = data, method = "ML")
    fixef(model) # return coefficient vector

maxModel.boot<-boot(maxvsnomax, boot.lme, 1000)
  maxModel.boot, index = 1, type = “perc”), index = 2, type = “perc”), index = 3, type = “perc”)

Then you find these confidence intervals for the three betas (intercept, Post-Max Sepultura vs. Max Era-Sepultura, Soulfly vs. Max-Era-Sepultura):

Based on 1000 bootstrap replicates

CALL : = maxModel.boot, type = “perc”, index = 1)

Intervals : 
Level     Percentile     
95%   ( 4.468,  4.620 )  
Calculations and Intervals on Original Scale

Based on 1000 bootstrap replicates

CALL : = maxModel.boot, type = “perc”, index = 2)

Intervals : 
Level     Percentile     
95%   (-0.6153, -0.3100 )  
Calculations and Intervals on Original Scale

Based on 1000 bootstrap replicates

CALL : = maxModel.boot, type = “perc”, index = 3)

Intervals : 
Level     Percentile     
95%   (-0.3861, -0.1503 )  
Calculations and Intervals on Original Scalens: 1534
Number of Groups: 22  

The difference in ratings between Sepultura without Max compared to with him was b = -0.47 [-0.62, -0.31]. The difference in ratings between Soulfly compared to Max-era Sepultura was b = -0.26 [-0.39, -0.15]. This suggests  that both soulfully and post-Max Sepultura yield negative parameters that reflect (to the degree that you believe that a confidence interval tells you about the population parameter ….) a negative effect in the population. In other words, both bands are rated worse than Max-era Sepultura.


Look, this is just a bit of fun and an excuse to show you how to use a bootstrap on a multilevel model, and how you can use data to try to answer pointless questions thrown at you on Twitter. Based on this hastily thrown together analysis that makes a lot of assumptions about a lot of things, my 120 character twitter response will be: Sepultura Max better than everything, but post 1996 Max is no better than No Max;-)

R Code


You can generate the data using this R code (data correct as of today):
morbid<-c(rep(1,2), rep(2, 4), rep(3, 3), rep(4, 8), rep(5, 36))
  Schizo<-c(rep(1,1), rep(2, 2), rep(3, 4), rep(4, 10), rep(5, 33))
  remains<-c(2, rep(3, 5), rep(4, 9), rep(5, 104))
  Arise<-c(rep(2, 2), rep(4, 16), rep(5, 89))
  Chaos<-c(rep(1,4), rep(2, 2), rep(3, 9), rep(4, 20), rep(5, 120))
  Roots<-c(rep(1,9), rep(2, 8), rep(3, 17), rep(4, 24), rep(5, 94))
  Against<-c(rep(1,16), rep(2, 14), rep(3, 11), rep(4, 20), rep(5, 32))
  Nation<-c(rep(1,3), rep(2, 7), rep(3, 6), rep(4, 22), rep(5, 19))
  Roorback<-c(rep(1,6), rep(2, 6), rep(3, 5), rep(4, 13), rep(5, 20))
  Dante<-c(rep(1,1), rep(2, 3), rep(3, 4), rep(4, 8), rep(5, 30))
  Alex<-c(rep(1,1), rep(2, 1), rep(3, 3), rep(4, 6), rep(5, 18))
  Kairos<-c(rep(1,3), rep(2, 2), rep(3, 2), rep(4, 6), rep(5, 33))
  Mediator<- c(rep(1,0), rep(2, 3), rep(3, 4), rep(4, 6), rep(5, 21))
  morbid<-data.frame(rep("Morbid", length(morbid)), rep(1986, length(morbid)), morbid)
  Schizo<-data.frame(rep("Schizo", length(Schizo)), rep(1987, length(Schizo)), Schizo)
  Remains<-data.frame(rep("remains", length(remains)), rep(1989, length(remains)), remains)
  Arise<-data.frame(rep("Arise", length(Arise)), rep(1991, length(Arise)), Arise)
  Chaos<-data.frame(rep("Chaos", length(Chaos)), rep(1993, length(Chaos)), Chaos)
  Roots<-data.frame(rep("Roots", length(Roots)), rep(1996, length(Roots)), Roots)
  Against<-data.frame(rep("Against", length(Against)), rep(1998, length(Against)), Against)
  Nation<-data.frame(rep("Nation", length(Nation)), rep(2001, length(Nation)), Nation)
  Roorback<-data.frame(rep("Roorback", length(Roorback)), rep(2003, length(Roorback)), Roorback)
  Dante<-data.frame(rep("Dante", length(Dante)), rep(2006, length(Dante)), Dante)
  Alex<-data.frame(rep("Alex", length(Alex)), rep(2009, length(Alex)), Alex)
  Kairos<-data.frame(rep("Kairos", length(Kairos)), rep(2011, length(Kairos)), Kairos)
  Mediator<-data.frame(rep("Mediator", length(Mediator)), rep(2013, length(Mediator)), Mediator)
  names(morbid)<-c("Album", "Year", "Rating")
  names(Schizo)<-c("Album", "Year", "Rating")
  names(Remains)<-c("Album", "Year", "Rating")
  names(Arise)<-c("Album", "Year", "Rating")
  names(Chaos)<-c("Album", "Year", "Rating")
  names(Roots)<-c("Album", "Year", "Rating")
  names(Against)<-c("Album", "Year", "Rating")
  names(Nation)<-c("Album", "Year", "Rating")
  names(Dante)<-c("Album", "Year", "Rating")
  names(Alex)<-c("Album", "Year", "Rating")
  names(Kairos)<-c("Album", "Year", "Rating")
  names(Mediator)<-c("Album", "Year", "Rating")

  SepMax<-rbind(morbid, Schizo, Remains, Arise, Chaos, Roots)
  SepMax$Band<-"Sepultura Max"
  SepNoMax<-rbind(Against, Nation, Dante, Alex, Kairos, Mediator)
  SepNoMax$Band<-"Sepultura No Max"
  SepNoMax$Max<-"No Max"
  soulfly<-c(rep(1,8), rep(2, 9), rep(3, 4), rep(4, 16), rep(5, 89))
  primitive<-c(rep(1,11), rep(2, 5), rep(3, 5), rep(4, 19), rep(5, 53))
  three<-c(rep(1,1), rep(2, 10), rep(3, 12), rep(4, 7), rep(5, 19))
  prophecy<-c(rep(1,2), rep(2, 5), rep(3, 5), rep(4, 25), rep(5, 42))
  darkages<-c(rep(1,1), rep(2, 1), rep(3, 5), rep(4, 18), rep(5, 36))
  conquer<-c(rep(1,1), rep(2, 0), rep(3, 5), rep(4, 5), rep(5, 31))
  omen<-c(rep(1,0), rep(2, 2), rep(3, 1), rep(4, 6), rep(5, 17))
  enslaved<-c(rep(1,1), rep(2,1), rep(3, 4), rep(4, 2), rep(5, 30))
  savages<-c(rep(1,0), rep(2, 2), rep(3, 3), rep(4, 10), rep(5, 27))
  archangel<-c(rep(1,3), rep(2, 2), rep(3, 4), rep(4, 7), rep(5, 21))

  soulfly<-data.frame(rep("Soulfly", length(soulfly)), rep(1998, length(soulfly)), soulfly)
  primitive<-data.frame(rep("Primitive", length(primitive)), rep(2000, length(primitive)), primitive)
  three<-data.frame(rep("Three", length(three)), rep(2002, length(three)), three)
  prophecy<-data.frame(rep("Prophecy", length(prophecy)), rep(2004, length(prophecy)), prophecy)
  darkages<-data.frame(rep("Darkages", length(darkages)), rep(2005, length(darkages)), darkages)
  conquer<-data.frame(rep("Conquer", length(conquer)), rep(2008, length(conquer)), conquer)
  omen<-data.frame(rep("Omen", length(omen)), rep(2010, length(omen)), omen)
  enslaved<-data.frame(rep("Enslaved", length(enslaved)), rep(2012, length(enslaved)), enslaved)
  savages<-data.frame(rep("Savages", length(savages)), rep(2013, length(savages)), savages)
  archangel<-data.frame(rep("Archangel", length(archangel)), rep(2015, length(archangel)), archangel)
  names(soulfly)<-c("Album", "Year", "Rating")
  names(primitive)<-c("Album", "Year", "Rating")
  names(three)<-c("Album", "Year", "Rating")
  names(prophecy)<-c("Album", "Year", "Rating")
  names(darkages)<-c("Album", "Year", "Rating")
  names(conquer)<-c("Album", "Year", "Rating")
  names(omen)<-c("Album", "Year", "Rating")
  names(enslaved)<-c("Album", "Year", "Rating")
  names(savages)<-c("Album", "Year", "Rating")
  names(archangel)<-c("Album", "Year", "Rating")

  Soulfly<-rbind(soulfly, primitive, three, prophecy, darkages, conquer, omen, enslaved, savages, archangel)
  maxvsnomax<-rbind(SepMax, SepNoMax, Soulfly)
  sepvssoul<-subset(maxvsnomax, Band != "Sepultura Max")

Graphs in this post


m <- ggplot(sepvssoul, aes(x = Rating, fill = Band, colour = Band)) 
  m + geom_histogram(binwidth = 1, alpha = .2, position=”identity”) + coord_cartesian(xlim = c(0, 6)) + scale_y_continuous(breaks = seq(0, 400, 50)) + labs(x = “Rating”, y = “Frequency”, colour = “Band”, fill = “Band”) + theme_bw()
  line <- ggplot(sepvssoul,  aes(Year, Rating, colour = Band))
  line + stat_summary(fun.y = mean, geom = “point”) + stat_summary( = mean_cl_boot, geom = “errorbar”, width = 0.2) + labs(x = “Year”, y = “Mean Rating on”) + coord_cartesian(ylim = c(1, 5)) + scale_x_continuous(breaks=seq(1998, 2015, 1))+ theme_bw() + stat_summary(fun.y = mean, geom = “line”, aes(group=Band))

To retract or to not retract, that is the question …

Amazingly I haven’t written a blog since September last year, it’s almost as though I have better things to do (or have no ideas about topics … or have nothing interesting to say). It’s more the lack of ideas/interesting things to say, oh, and other people are so much better at statistics blogging than I am (see Dan Lakens for example). Still, twitter provided me with inspiration this morning as reports filtered through of the latest in a long line of Psychological Science retractions. This particular one is for an article by Fisher et al. (2015) in which they showed (according to retraction watch) that ‘women prefer to wear makeup when there is more testosterone present in their saliva’.  The full details of the retraction are also helpfully described by retraction watch if you’re interested. The main reason for the retraction though as described by the authors was as follows (see here):

“Our article reported linear mixed models showing interactive effects of testosterone level and perceived makeup attractiveness on women’s makeup preferences. These models did not include random slopes for the term perceived makeup attractiveness, and we have now learned that the Type 1 error rate can be inflated when by-subject random slopes are not included (Barr, Levy, Scheepers, & Tily, 2013). Because the interactions were not significant in reanalyses that addressed this issue, we are retracting this article from the journal.”

The purpose of this blog is to explain why I believe (other things being equal) the authors should have published a correction and not retracted the article. Much of what I think isn’t specific to this example, it just happens to have been triggered by it.

To retract …

I assume that the authors’ decision to retract is motivated by a desire to rid the world of false knowledge. By retracting, the original paper is removed from the universe thus reducing the risk of ‘false knowledge’ on this topic spreading. A correction would not minimise the risk that the original article was cited or followed up by other researchers unless it was sort of tagged onto the end of the paper. If a correction appears as a separate paper then it may well be overlooked. However, I think this is largely a pragmatic issue for the publishers to sort out: just make it impossible for someone to get the original paper without also getting the correction. Job done.

To not retract …

If you read the full account of the retraction, the authors fitted a model, published the details of that model in the supplementary information with the paper and then posted their data the Open Science Framework for others to use. They have been completely transparent. Someone else re-analysed the data and included the aforementioned random slope, and alerted the authors to the differences in the model (notably this crucial interaction term). The authors retracted the paper. I would argue that a correction would be better for the following reasons.

Undeserved repetitional damage

One of the things that really bugs me about science these days (especially psychology) is the witch-hunt-y-ness of it (yes, that’s not a real word). Scientists happily going about their business with good intentions, make bad decisions and suddenly everyone is after their head. This is evidenced by the editor feeling the need to make this comment in the retraction: “I would like to add an explicit statement that there is every indication that this retraction is entirely due to an honest mistake on the part of the authors.” The editor is attempting damage limitation for the authors.
The trouble is that retractions come with baggage ranging from ‘the scientists don’t know what they’re doing’ (at best) to hints that they have deliberately misled everyone for their own gain. This baggage is unnecessary. Don’t get me wrong, I’ve seen people do terrible things with data (in the vast majority of cases out of ignorance, not deceit) and I’m convinced that the incentive structures in academia are all wrong (quantity is valued over quality), but deep down I still like to believe that scientists care about science/knowledge. Given how open they have been with their analysis and data, these scientists strike me as being people who care about science They are to be applauded for their openness, and not burdened with the baggage of retraction. A correction would have better reflected their honesty and integrity.

Retraction implies there is one correct way to model the data

Retracting the paper implies ‘we did it wrong’. Did the authors analyse their data incorrectly though? Here’s some food for thought. Raphael Silberzahn and colleagues published a paper in which they gave the same research question and the same data set to 29 research teams and examined how they addressed the question (there is an overview of the paper here, and the paper itself is available here). Essentially they found a lot of variability in what statistical models were applied to answer the question including tobit regression, logistic regression (sometimes multilevel, sometimes not), poisson regression (sometimes multilevel, sometimes not), spearman’s correlation, OLS regression, WLS regression, Bayesian logistic regression (sometimes multilevel, sometimes not). You get the gist. The resulting odds ratios for the effect ranged from 0.89 to 2.93 (although all but 2 were > 1). Confidence intervals for these odds ratios ranged in width quite widely. The positive thing was that if you look at Figure 1 in the paper, despite variation in the models applied, there was a fair bit of consensus in the odds ratio and confidence intervals produced (about half of the point estimates/CIs  – the ones from team 26 to team 9 – line up pretty well despite the varying models applied). However, it goes to show that give a data set and a question to 29 research teams and they will analyse it differently. Is there one correct model? Are 28 teams wrong and 1 team correct. No, data analysis is always about decisions, and although there can be unequivocally wrong decisions, there is rarely only one correct decision.

So, Fisher and colleagues didn’t include a random slope, someone else did. This change in model specification affected the model parameters and p-values. Is the inclusion of the random slope any more correct than it’s exclusion? That’s somewhat a matter of opinion. Of course, its exclusion could have led to a Type I error (if you fixate on p-values), but the more interesting question is why it changes the model, how it changes it and what the implications are moving forward. The message (for me) from the Silberzahn paper is that if any of us let other scientists loose with our data, they would probably do different things with it that would affect the model parameters and p-values. Just as has happened here. The logic of this particular retraction is that every scientist should retract every paper they have ever published on the grounds that there were probably other models that could have been fit, and if they had been then the parameter estimates in the paper would be different. A correction (rather than retraction) would have allowed readers and researcher in this field to consider the findings in the light of the difference that the random slope makes to the model.

Retraction devalues the original study

Here’s how science works. People generate theories, they transform them into testable hypotheses, they collect data, they evaluate the evidence for their hypothesis. Then other people get interested in the same theory and collect more data and this adds to the collective knowledge on the topic. Sometimes the new data contradicts the old data, in which case people update their beliefs. We do not, however, retract all of the old papers because this new one has thrown up different evidence. That would be silly, and yet I think that is all that has happened here with Fisher et al.’s paper. They fitted one model to the data, drew some conclusions, then someone else moved forward with the data and found something different. Retraction implies that the original study was of no value whatsoever, it must be hidden away never to be seen. Regardless of how you analyse the data, if the study was methodologically sound (I don’t know if it was – I can’t access it because its been retracted) then it adds value to the research question irrespective of the significance of an interaction in the model. A retraction removes this knowledge from the world, it becomes a file drawer paper rather than information that is out in the open. We are deprived of the evidence within the paper (including how that evidence changes depending on what model you fit to the data). A correction allows this evidence to remain public, and better still updates that evidence in the light of new analysis in useful ways …

Retraction provides the least useful information about the research question

By retracting this study we are none the wiser about the hypothesis. All we know is that a p-value that was below < .05 flipped to the other side of that arbitrary threshold when a random slope was included in the model. It could have changed from .049 to .051 for all we know, in which case the associated parameter has most likely not really changed much at all. It might have changed from .00000000001 to .99, in which case the impact has been more dramatic. A retraction deprives us of this information. In a correction, the authors could present the new model, its parameters and confidence intervals (incidentally, on the topic of which I recommend Richard Morey’s recent paper) and we could see how things have changed as a result of including the random slope. A correction provides us with specific and detailed evidence with which we can update our beliefs from the original paper. A correction allows the reader to determine what they believe. A retraction provides minimal and (I’d argue) unhelpful information about how the model changed, and about how to update our beliefs about the research question. All we are left with is to throw the baby out with the bathwater and pretend, like the authors, that the study never happened. If the methods were sound, then the study is valuable, and the new analysis is not damning but simply sheds new light on the hypotheses being tested. A retraction tells us little of any use.

Where to go …

What this example highlights to me is how science needs to change, and how the publication process also needs to change. Science moves forward through debate, through people challenging ideas, and this is a good example. If the paper were in a PLoS style journal that encourages debate/comment then a retraction would not have been necessary. Instead, the models and conclusions could have been updated for all to see, the authors could have updated their conclusions based on these new analyses, and knowledge on the topic would have advanced. You’d end up with a healthy debate instead of evidence being buried. One of the challenges of open science and the OSF is to convince scientists that by making their data public they are not going to end up in these situations where they are being witch hunted, or pressured into retractions. Instead, we need to embrace systems that allow us to present different angles on the same data, to debate conclusions, and to strive for truth by looking at the same data from different perspectives … and for none of that to be perceived as a bad thing. Science will be better for it.


Fisher, C. I., Hahn, A. C., DeBruine, L. M., & Jones, B. C. (2015). Women’s preference for attractive makeup tracks changes in their salivary testosterone. Psychological Science, 26, 1958–1964. doi:10.1177/0956797615609900

The referee’s a …..

I’m a bit late on this particular bandwagon, but it’s a busy time what with the start of term and finishing writing  textbook and all that. The textbook is also my excuse for having not written a blog for a year, well, that and the fact I rarely have anything interesting to say.

Anyway, those of you who follow football (or soccer for our US friends) will know the flack that referees get. I am a regular at the stadium of the team I’ve supported* since the age of 3 and like most football stadiums I regularly hear the chant of ‘The referee’s a wanker’ echoing around the ground. Referees have a tough job: the game is fast, often faster than they are, players can be a bit cheaty, and we have the benefit of TV replays which they (for some utterly inexplicable reason) do not. Nevertheless it’s annoying when they get stuff wrong.

The more specific referee-related thing I often hear resonating around the particular stadium that I attend is ‘Oh dear me, Mike Dean is refereeing … he hates us quite passionately’ or something similar with a lot more swearing. This particular piece of folk wisdom reached dizzy new heights the weekend before last when he managed to ignore Diego Costa trying to superglue his hands to Laurent Koscielny’s face shortly before chest bumping him to the floor, and then sending Gabriel off for … well, I’m not really sure what. Indeed, the FA weren’t really sure what for either because they rescinded the red card and banned Costa for 3 matches. You can read the details here.

So, does Mike Dean really hate Arsenal? In another blog 7amkickoff tried to answer this question using data collated about Mike Dean. You can read it here. This blog has inspired me to go one step further and try to answer this question as a scientist would.

How does a scientist decide if Mike Dean really is a wanker?

Scientists try to work with scientific statements. ‘Mike Dean is a wanker’ may be the folklore at the Emirates, but it is not a scientific statement because we probably can’t agree on a working definition of what a wanker is (at least not in the refereeing sense). However, we can turn it into a scientific statement by changing the question to something like ‘Do arsenal lose more games when Mike Dean (MD) referees?’ This is what 7amkickoff tried to do. 7amkickoff presented data from 2009-2015 and concluded that ‘Arsenal’s win% in the Premier League since 2009 is 57% so a 24% win rate is quite anomalous’ (the 24% is the win rate under Dean). There are several problems with this simplistic view, a couple of them are:
  1. MD tends to referee Arsenal’s tougher games (opponents such as Chelsea, Manchester United and Manchester City), so we need to compare Arsenal’s win rate under MD to our win rate against only the same opponents as in the MD games. (In case you’re interested the full list of clubs is Birmingham, Blackburn, Burnley, C Palace, Charlton, Chelsea, Fulham, Man City, Man Utd, Newcastle, QPR, Stoke City, Tottenham, Watford, West Brom, West Ham, Wigan). If we compare against a broader set of opponents then it isn’t a fair comparison to MD: we are not comparing like for like.
  2.  How do we know that a win rate of 24% under MD isn’t statistically comparable to a win rate of 57%. They seem different, but couldn’t such a difference simply happen because of the usual random fluctuations in results? Or indeed because Arsenal are just bad at winning against ‘bigger’ teams and MD happens to officiate those games (see the point above)
I’m going to use the same database as 7amkickoff and extend this analysis. I collected data from for the past 10 years (2005-6 up to the Chelsea fixture on 19th Sept 2015). I am only looking at English Premier League (EPL) games. I have filtered the data to include only the opponents mentioned above, so we’ll get a fair comparison of Arsenal’s results under MD against their results against those same clubs when MD is not officiating. The data are here as a CSV file.
A scientist essentially sets up two competing hypotheses. The first is that there is no effect of MD, that is Arsenal’s results under MD are exactly the same as under other referees. This is known as the null hypothesis. The second is the opposite: that there is an effect of MD, that is Arsenal’s results under MD are different compared to other referees. This is known as the alternative hypothesis. We can use statistical tests to compare these hypotheses.

A frequentist approach

To grab the file into R run:
arse<-read.csv(file.choose(), header = T)
This creates a data frame called arse. 
If we look at the results of all EPL arsenal fixtures against the aforementioned teams since 2005, we get this, by running the following code in R: <- xtabs(~MD+result, data=arse)
       Draw Lose Win
Other   33   43  115
MD      14   11   9
One way to test whether the profile of results is different for MD is to see whether there is a significant relationship between the variable ‘MD vs Other’ and ‘result’. This asks the question of whether the profile of draws to wins and loses is statistically different for MD than ‘other’. If we assume that games are independent events (which is a simplifying assumption because they won’t be) we can use a chi-square test.
You can fit this model in R by running:
CrossTable(, fisher = TRUE, chisq = TRUE, sresid = TRUE, format = “SPSS”)

Total Observations in Table:  225 

             | result 
          MD |     Draw  |     Lose  |      Win  | Row Total | 
           0 |       33  |       43  |      115  |      191  | 
             |    1.193  |    0.176  |    0.901  |           | 
             |   17.277% |   22.513% |   60.209% |   84.889% | 
             |   70.213% |   79.630% |   92.742% |           | 
             |   14.667% |   19.111% |   51.111% |           | 
             |   -1.092  |   -0.419  |    0.949  |           | 
           1 |       14  |       11  |        9  |       34  | 
             |    6.699  |    0.988  |    5.061  |           | 
             |   41.176% |   32.353% |   26.471% |   15.111% | 
             |   29.787% |   20.370% |    7.258% |           | 
             |    6.222% |    4.889% |    4.000% |           | 
             |    2.588  |    0.994  |   -2.250  |           | 
Column Total |       47  |       54  |      124  |      225  | 
             |   20.889% |   24.000% |   55.111% |           | 

Statistics for All Table Factors

Pearson’s Chi-squared test 
Chi^2 =  15.01757     d.f. =  2     p =  0.0005482477 

Fisher’s Exact Test for Count Data
Alternative hypothesis: two.sided
p =  0.0005303779 

       Minimum expected frequency: 7.102222 

The win rate under MD is 26.47% compared to 41.18% draws and 32.35% losses, under other referees it is 60.21%, 17.28% and 22.51% respectively. These are the comparable values to 7amkickoff but looking at 10 years and including only opponents that feature in MD games. The critical part of the output is the p = .0005. This is the probability that we would get the chi-square value we have IF the null hypothesis were true. In other words, if MD had no effect whatsoever on Arsenal’s results the probability of getting a test result of 15.12 would be 0.000548. In other words, very small indeed. Scientists do this sort of thing all of the time, and generally accept that if p is less than .05 then this supports the alternative hypothesis. That is we can assume it’s true. (In actual fact, although scientists do this they shouldn’t because this probability value tells us nothing about either hypothesis, but that’s another issue …) Therefore, by conventional scientific method, we would accept that the profile of results under MD is significantly different than under other referees. Looking at the values in the cells of the table, we can actually see that the profile is significantly worse under MD than other referees in comparable games. Statistically speaking, MD is a wanker (if you happen to support Arsenal).

As I said though, we made simplifying assumptions. Let’s look at the raw results, and this time factor in the fact that results with the same opponents will be more similar than results involving different opponents. That is, we can assume that Arsenal’s results against Chelsea will be more similar to each other than they will be to results agains a different club like West Ham. What we do here is nest results within opponents. In doing so, we statistically model the fact that results against the same opposition will be similar to each other. This is known as a logistic multilevel model. What I am doing is predicting the outcome of winning the game (1 = win, 0 = not win) from whether MD refereed (1 = MD, 0 = other). I have fitted a random intercepts model, which says overall results will vary across different opponents, but also a random slopes model which entertains the possibility that the effect of MD might vary by opponent. The R code is this:

win.ri<-glmer(win ~ 1 + (1|Opponent), data = arse, family = binomial, na.action = na.exclude)<-update(win.ri, .~. + MD)<-update(win.ri, .~ MD + (MD|Opponent))

The summary of models shows that the one with the best fit is random intercepts and MD as a predictor. I can tell this because the model with MD as a predictor significantly improves the fit of the model (the p of .0024 is very small) but adding in the random slopes component does not improve the fit of the model hardly at all (because the p of .992 is very close to 1).

Data: arse
win.ri: win ~ 1 + (1 | Opponent) win ~ (1 | Opponent) + MD win ~ MD + (MD | Opponent)
       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)   
win.ri  2 302.41 309.24 -149.20   298.41                     3 295.22 305.47 -144.61   289.22 9.1901      1   0.002433 **  5 299.20 316.28 -144.60   289.20 0.0153      2   0.992396  

If we look at the best fitting model ( by running:


We get this:

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [‘glmerMod’]
 Family: binomial  ( logit )
Formula: win ~ (1 | Opponent) + MD
   Data: arse

     AIC      BIC   logLik deviance df.resid 
   295.2    305.5   -144.6    289.2      222 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.6052 -0.7922  0.6046  0.7399  2.4786 

Random effects:
 Groups   Name        Variance Std.Dev.
 Opponent (Intercept) 0.4614   0.6792  
Number of obs: 225, groups:  Opponent, 17

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   0.5871     0.2512   2.337  0.01943 * 
MDMD         -1.2896     0.4427  -2.913  0.00358 **

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
MDMD -0.225

The important thing is whether the variable MD (which represents MD vs. other referees) is a ‘significant predictor’. Looking at the p-value of .00358,we can assess the probability that we would get an estimate as large as -1.2896 if the null hypothesis were true (i.e. if MD had no effect at all on Arsenal’s results). Again, because this value is less than .05, scientists would typically conclude that MD is a significant predictor of whether Arsenal win or not, even factoring in dependency between results when the opponent is the same. The value of the estimate (-1.2896) tells us about the strength and direction of the relationship and because our predictor variable is MD = 1, other  = 0, and our outcome is win = 1 and no win = 0, and the value is negative it means that as MD increases (in other words as we move from having ‘other’ as a referee to having MD as a referee) the outcome decreases (that is, the probability of winning decreases). So, this analysis shows that MD significantly decreases the probability of Arsenal winning. The model is more complex but the conclusion is the same as the previous, simpler, model: statistically speaking, MD is a wanker (if you happen to support Arsenal).

A Bayesian Approach

The frequentist approach above isn’t universally accepted because these probability values that I keep mentioning don’t allow us to test directly the plausibility of the null and alternative hypothesis. A different approach is to use a Bayes Factor. Bayes Factors quantify the ratio of the probability of the data under the alternative hypothesis relative to the null. A value of 1, therefore, means that the observed data are equally probable under the null and alternative hypotheses, values below 1 suggest that the data are more probable under the null hypotheses relative to the alternative. Bayes factors provide information about the probability of the data under the null hypothesis (relative to the alternative) whereas significance tests (as above) provide no evidence at all about the status of the null hypothesis. Bayes Factors are pretty easy to do using Richard Morey’s excellent BayesFactor  package in R (Morey & Rouder, 2014). There’s a lot more technical stuff to consider, which I won’t get into … for example, to do this properly we really need to set some prior believe in our hypothesis. However, Morey’s packages uses some default priors that represent relatively open minded beliefs (about MD in this case!).
To get a Bayes Factor for our original contingency table you’d run:
bayesMD = contingencyTableBF(, sampleType = “indepMulti”, fixedMargin = “cols”)
and get this output:
Bayes factor analysis
[1] Non-indep. (a=1) : 34.04903 ±0%

Against denominator:
  Null, independence, a = 1 
Bayes factor type: BFcontingencyTable, independent multinomial

The resulting Bayes Factor of 34.05 is a lot bigger than 1. The fact it is bigger than 1 suggests that the probability of the data under the alternative hypothesis is 34 times greater then the probability of the data under the null. In other words, we should shift our prior beliefs towards the idea the MD is a wanker by a factor of about 34. Yet again, statistically speaking, MD is a wanker (if you happen to support Arsenal).


The aim of this blog is to take a tongue in cheek look at how we can apply statistics to a real-world question that vexes a great many people. (OK, the specific Mike Dean question only vexes Arsenal fans, but football fans worldwide are vexed by referees on a regular basis). With some publicly available data you can test these hypotheses, and reach conclusions based on data rather than opinion. That’s the beauty of statistical models.
I’ll admit that to the casual reader this blog is going to be a lot more tedious than 7amkickoff’s. However, what I lack in readability I gain in applying some proper stats to the data. Whatever way you look at it, Mike Dean, statistically speaking does predict poorer outcomes for Arsenal. There could be any number of other explanations for these data, but you’d think that the FA might want to have a look at that. Arsenal’s results under Mike Dean are significant;y different to under other referees, and these models show that this is more than simply the random fluctuations in results that happen with any sports team.

On a practical note, next time I’m at the emirates and someone says ‘Oh shit, it’s Mike Dean, he hates us …’ I will be able to tell him or her confidently that statistically speaking they have a point … well, at least in the sense that Arsenal are significantly more likely to lose when he referees!

*Incidentally, I’m very much of the opinion that people are perfectly entitled to support a different team to me, and that the football team you support isn’t a valid basis for any kind of negativity. If we all supported the same team it would be dull, and if everyone supported ‘my’ team, it’d be even harder for me to get tickets to games. So, you know, let’s be accepting of our differences and all that ….

Perhaps my oxytocin was low when I read this paper.

I’m writing a new textbook on introductory statistics, and I decided to include an example based on Paul Zak‘s intriguing work on the role of the hormone oxytocin in trust between strangers. In particular, I started looking at his 2005 paper in Nature. Yes, Nature, one of the top science journals. A journal with an impact factor of about 38, and a rejection rate probably fairly close to 100%. It’s a journal you’d expect published pretty decent stuff and subjects articles to fairly rigorous scrutiny.

Before I begin, I have no axe to grind here with anyone. I just want to comment on some stuff I found and leave everyone else to do what they like with that information. I have no doubt Dr. Zak has done a lot of other work on which he bases his theory, I’m not trying to undermine that work, I’m more wanting to make a point about the state of the top journals in science. All my code here is for R.

The data I was looking at relates to an experiment in which participants were asked to invest money in a trustee (a stranger). If they invested, then the total funds for the investor and trustee increased. If the trustee shares the proceeds then both players end up better off, but if the trustee does not repay the investors’ trust by splitting the fund then the trustee ends up better off and the investor worse off.  The question is, will investors trust the trustees to split the funds? If they do then they will invest, if not they won’t. Critically, one group of investors were exposed to exogenous oxytocin (N = 29), whereas a placebo group were not (N = 29). The oxytocin group invested significantly more than the placebo control group suggesting that oxytocin had causally influenced their levels of trust in the trustee. This is the sort of finding that the media loves.

The paper reports a few experiments, I want to focus on the data specifically related to trust shown in Figure 2a (reproduced below):

 The good thing about this figure is that it shows relative frequencies, which means that with the aid of a ruler and a spreadsheet we can reconstruct the data. Based on the figure the raw data is as follows:

placebo<-c(3, 3, 4, 4, 4, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 9, 9, 11, 11, 11, 12, 12, 12, 12, 12, 12)

oxy<-c(3, 4, 4, 6, 6, 7, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)

The problem being that this gives us only N = 26 in the placebo group. Bollocks!

Ok, well perhaps they reported the wrong N. Here’s their table of descriptives:

Let’s have a look at the descriptives I get (you’ll need the pastecs package):

> round(stat.desc(placebo, basic = F), 1)
      median         mean      SE.mean CI.mean.0.95          var     coef.var 
         7.5          7.9          0.6          1.3         10.2          3.2          0.4 
> round(stat.desc(oxy, basic = F), 1)
      median         mean      SE.mean CI.mean.0.95          var     coef.var 
        10.0          9.6          0.5          1.1          8.1          2.8          0.3 
For the oxytocin group the mean, median and SD match, but for the placebo group they don’t. Hmmm. So, there must be missing cases. Based on where the median is, I guessed that the three missing cases might be values of 10. In other words, Figure 2a in the paper, ought to look like this:
So, the placebo group data now looks like this:

placebo<-c(3, 3, 4, 4, 4, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 12, 12, 12, 12)

Let’s look at the descriptives now:
> round(stat.desc(placebo, basic = F), 1)
      median         mean      SE.mean CI.mean.0.95          var     coef.var 
         8.0          8.1          0.6          1.2          9.5          3.1          0.4 
They now match the table in the paper. So, this gives me confidence that I have probably correctly reconstructed the raw data despite Figure 2’s best efforts to throw me off of the scent. Of course, I could be wrong. Let’s see if my figures match their analysis. They did a Mann-Whitney U that yielded a p-value of 0.05887, which they halved to report one-tailed as .029. Notwithstanding the fact that you probably shouldn’t do one-tailed tests, ever, they conclude a significant between group difference. I replicate their p-value, again convincing me that my reconstructed data matches their raw data.
Put data into a data frame and do wilcoxon rank sum test (which is equivalent to Mann-Whitney):
zak<-data.frame(gl(2, 29), c(placebo, oxy))
names(zak)<-c("Group", "MU")
zak$Group<-factor(zak$Group, labels = c("Placebo", "Oxytocin"))

> wilcox.test(MU~Group, zak)

Wilcoxon rank sum test with continuity correction

data:  MU by Group
W = 301, p-value = 0.05887
alternative hypothesis: true location shift is not equal to 0
So, all well and good. There are better ways to look at this than a Mann-Whitney test though, so let’s use some of Rand Wilcox’s robust tests. First off, a trimmed mean and bootstrap:

> yuenbt(placebo, oxy, tr=.2, alpha=.05, nboot = 2000, side = T)
[1] -3.84344384  0.05397015

[1] -1.773126

[1] 0.056

[1] 8.315789

[1] 10.21053

[1] -1.894737

[1] 29

[1] 29
Gives us a similar p-value to the Mann-Whitney and a confidence interval for the difference that contains zero. However, the data are very skewed indeed:
Perhaps we’re better off comparing the medians of the two groups. (Medians are no-where near as biased by skew as means):

> pb2gen(placebo, oxy, alpha=.05, nboot=2000, est=median)
[1] “Taking bootstrap samples. Please wait.”
[1] 8

[1] 10

[1] -2

[1] -6  1

[1] 0.179

[1] 2.688902

[1] 29

[1] 29
The p-value is now a larger .179 (which even if you decide to halve it won’t get you below the, not very magic, .05 threshold for significance). So, the means might be significantly different depending on your view of one-tailed tests, but the medians certainly are not. Of course null hypothesis significance testing is bollocks anyway (see my book, or this blog) so maybe we should just look at the effect size. Or maybe we shouldn’t because the group means and SDs will be biased by the skew (SDs especially because you square the initial bias to compute it). Cohen’s d is based on means and SDs so if these statistics are biased then d will be too. I did it anyway though, just for a laugh:
> d<-(mean(oxy)-mean(placebo))/sd(placebo)
> d
[1] 0.4591715
I couldn’t be bothered to do the pooled estimate variant, but because there is a control group in this study it makes sense to use the control group SD to standardise the mean difference (because the SD in this condition shouldn’t be affected my the experimental manipulation). The result? About half a standard deviation difference (notwithstanding the bias). That’s not too shabby, although it’d be even more un-shabby if the estimate wasn’t biased. Finally, and notwithstanding various objections to using uninformed priors in Bayesian analysis) we can use the BayesFactor packaged to work out a Bayes Factor
> ttestBF(oxy, y = placebo)
Bayes factor analysis
[1] Alt., r=0.707 : 1.037284 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
Bayes factor type: BFindepSample, JZS
The Bayes factor is 1.03, which means that there is almost exactly equal evidence for the null hypothesis as the alternative hypothesis. In other words, there is no support for the hypothesis that oxytocin affected trust.
So, Nature, one of the top journals in science, published a paper where the evidence for oxytocin affecting trust was debatable at best. Let’s not even get into the fact that this was based on N = 58. Like I said, I’m sure Dr. Zak has done many other studies and I have no interest in trying to undermine what he or his colleagues do, I just want to comment on this isolated piece of research and offer an alternative interpretation of the data. There’s lots of stuff I’ve done myself that with the benefit of experience I’d rather forget about, so, hey, I’m not standing on any kind of scientific high ground here. What I would say is that in this particular study, based on the things least affected by the shape of the data (medians) and a (admittedly using uninformed priors) Bayesian analysis there is not really a lot of evidence that oxytocin affected trust.

Bat Cunnilingus and Spurious Results

Thanks to my brother I became aware of this article about bat cunnilingus today. My brother, knowing how entertained I had been by a bat felating itself at Disney on my honeymoon, decided that this would be just the sort of article that I’d like. Knowing the kind of examples I put in my textbooks I suspect his email is the first of many. Anyway, this article suggests that male bats engage in cunnilingus with females. Female bats, that is. They have videos to prove it. Male bats like cunnilingus so much that they don’t just use it as foreplay, but they get stuck in afterwards as well. The article has two main findings:
  1. The more that males (bats) engage in cunnilingus, the longer the duration of intercourse, r = .91 (see their Figure1). They tested this with a Pearson r, based on 8 observations (although each observation was the average of several data points I think – it’s not entirely clear how the authors came about these averages). They conclude that this behaviour may be due to sperm competition in that males sort of ‘clean out’ the area before they procreate. They argue that the longer intercourse duration supports this argument because longer duration = greater chance of paternity.
  2.  The more that males engage in pre-intercourse cunnilingus, the less they engage in post-intercourse cunnilingus, r = .79 (see their Figure2). They again tested this with a Pearson r, based on 8 observations. They conclude “… the negative relationship between the duration of pre- and post-copulatory cunnilingus suggests the possibility of sperm competition. Thus, the male removes the sperm of competitors but apparently not its own – lick for longer before but less after copulation, if paternity uncertainty is high.”

There are several leaps of faith in their conclusions. Does longer duration necessarily imply greater chance of paternity? Also, if sperm competition is at the root of all of this then I’m not sure post coital cunilingus is really ever an adaptive strategy: consuming your own sperm hardly seems like the best method of impregnating your mate. Yes, I’m still talking about bats here, what you choose to do in your own free time is none of my business.

Enough about Cunnilingus, what about Statistics?

Anyway, I want to make a statistical point, not an evolutionary one. It struck me when looking at this article that finding 2 is probably spurious. Figure 2 looks very much like two outliers are making a relationship out of nothing. With only 8 observations they are likely to have a large impact. With 8 observations it is, of course, hard to say what’s an outlier and what isn’t; however, the figure was like a bat tongue to my brain: my juices were flowing. Fortunately, with so few observations we can approximate the data from the two figures. I emphasises that these are quick estimates of values based on the graphs and a ruler. We have three variables estimated from the paper: pre (pre-intercourse cunnilingus), post (post-intercourse cunnilingus), duration (duration of intercourse).
We’re going to use R, if you don’t know how to use it then some imbecile called AndyField has written a book on it that you could look at. We can input these approximate scores as:
pre<-c(46, 50, 51, 52, 53, 53.5, 55, 59)
post<-c(162, 140, 150, 140, 142, 138, 152, 122)
duration<-c(14.4, 14.4, 14.9, 15.5, 15.4, 15.2, 15.3, 16.4)
bat<-data.frame(pre, post, duration)
Ok, so that’s the three variables entered, and I’ve made a data frame from them called bat.
Let’s load some packages that will be useful (if you don’t have them installed then install them) and source Rand Wilcox’s very helpful robust functions:
Let’s draw some rough replicas of Figures 1 and 2 from the paper:
qplot(pre, post, date = bat, ylim = c(85, 200), xlim = c(44, 60)) + stat_smooth(method = “lm”)
qplot(pre, duration, date = bat, ylim = c(13, 18), xlim = c(44, 60))+ stat_smooth(method = “lm”)

Figure 1: Quick replica of Maruthupandian & Marimuthu (2013) Figure 1
Figure 1: Quick replica of Maruthupandian & Marimuthu (2013) Figure 2
As you can see the figures – more or less – replicate their data. OK, let’s replicate their correlation coefficient estimates by executing:
rcorr(as.matrix(bat), type = “pearson”)
gives us
           pre  post duration
pre       1.00 -0.78     0.91
post     -0.78  1.00    -0.75
duration  0.91 -0.75     1.00
n= 8

         pre    post   duration
pre             0.0225 0.0018 
post     0.0225        0.0308 
duration 0.0018 0.0308
So, we replicate the pre-intercourse cunnilingus (Pre) – duration correlation of .91 exactly (p= .0018), and we more or less get the right value for the pre-post correlation: we get r = .78 (p = .0225) instead of -.79, but we’re within rounding error. So, it looks like our estimates of the raw data from the Figures in the article aren’t far off the real values at all. Just goes to show how useful graphs in papers can be for extracting the raw data.
If I’m right and the data in Figure 2 are affected by the first and last case, we’d expect something different to happen if we calculate the correlation based on ranked scores (i.e. Spearman’s rho). Let’s try it:
rcorr(as.matrix(bat), type = “spearman”)
           pre  post duration
pre       1.00 -0.50     0.78
post     -0.50  1.00    -0.51
duration  0.78 -0.51     1.00
n= 8
         pre    post   duration
pre             0.2039 0.0229 
post     0.2039        0.1945 
duration 0.0229 0.1945 
The Pre-duration correlation has dropped from .91 to .78 (p = .0229) but it’s still very strong. However, the pre-post correlation has changed fairly dramatically from -.79 to r = .50 (p = .204) and is now not significant (although I wouldn’t want to read to much into that with an n = 8).
What about if we try a robust correlation such as the percentile bend correlation described by <!–[if supportFields]> ADDIN EN.CITE Wilcox2012684Wilcox (2012)6846846Wilcox, R. R.Introduction to robust estimation and hypothesis testing3rd2012Burlington, MAElsevier<![endif]–>Wilcox (2012)<!–[if supportFields]><![endif]–>? Having sourced Wilcox’s functions we can do this easily enough by executing:
pbcor(bat$pre, bat$post)
pbcor(bat$pre, bat$duration)
For the pre post correlation we get:
[1] -0.3686576
[1] -0.9714467
[1] 0.368843
[1] 8
It has changed even more dramatically from -.79 to r = .37 (p = .369) and is now not significant (again I wouldn’t want to read to much into that with an n = 8, but it is now a much smaller effect than they’re reporting in the paper).
The pre-duration correlation fares much better:
[1] 0.852195
[1] 3.989575
[1] 0.007204076
[1] 8
This is still very strong and significant, r = .85 (p = .007). We could also try to estimate the population value of these two relationships by computing a robust confidence interval using bootstrapping. We’ll stick with Pearson r seeing as that’s the estimate that they used in the paper but you could try this with the other measures if you like:
bootr<-function(data, i){
cor(data[i, 1],data[i, 2])
bootprepost<-boot(bat[1:2], bootr, 2000)
bootpreduration<-boot(bat[c(1, 3)], bootr, 2000)
If we look at the BCa intervals, we get this for the pre-post correlation:
Level     Percentile            BCa         
95%   (-0.9903,  0.5832 )   (-0.9909,  0.5670 )
and this for the pre-duration correlation:
Level     Percentile            BCa         
95%   ( 0.6042,  0.9843 )   ( 0.5022,  0.9773 )
In essence then, the true relationship between pre- and post-intercourse cunnilingus duration lies somewhere between about –1 and 0.6. In other words, it could be anything really, including zero. For the relationship between pre-intercourse bat cunnilingus and intercourse duration the true relationship is somewhere between r= .5 and .98. In other words, it’s likely to be – at the very worst – a strong and positive association.


What does this tell us? Well, I’m primarily interested in using this as a demonstration of why it’s important to look at your data and to employ robust tests. I’m not trying to knock the research. Of course, in the process I am, inevitably, sort of knocking it. So, I think there are three messages here:
  1. Their first finding that the more that males (bats) engage in cunnilingus, the longer the duration of intercourse is very solid. The population value could be less than the estimate they report, but it could very well be that large too. At the very worst it is still a strong association.
  2. The less good news is that finding 2 (that the more that males engage in pre-intercourse cunnilingus, the less they engage in post-intercourse cunnilingus) is probably wrong. The best case scenario is that the finding is real but the authors have overestimated it. The worst case scenario is that the population effect is in the opposite direction to what the authors report, and there is, of course, a mid ground that there is simply no effect at all in the population (or one so small as to be not remotely important).
  3. It’s also worth pointing out that the two extreme cases that have created this spurious result may in themselves be highly interesting observations. There may be important reasons why individual bats engage in unusually long or short bouts of cunnilingus (either pre or post intercourse) and finding out what factors affect this will be an interesting thing that the authors should follow up.


FAQ #1: K-S Tests in SPSS

I decided to start a series of blogs on questions that I get asked a lot. When I say a series I’m probably raising expectation unfairly: anyone who follows this blog will realise that I’m completely crap at writing blogs. Life gets busy. Sometimes I need to sleep. But only sometimes.

Anyway, I do get asked a lot about why there are two ways to do the Kolmogorov-Smirnov (K-S) test in SPSS. In fact, I got an email only this morning. I knew I’d answered this question many times before, but I couldn’t remember where I might have saved a response. Anyway, I figured if I just blog about it then I’d have a better idea of where I’d written a response. So, here it is. Anyway, notwithstanding my reservations about using the K-S test (you’ll have to wait until edition 4 of the SPSS book), there are three ways to get one from SPSS:

  1. Analyze>explore>plots> normality plots with tests
  2. Nonparametric Tests>One Sample … (or legacy dialogues>one sample KS)
  3. Tickle SPSS under the chin and whisper sweet nothings into its ear
These methods give different results. Why is that? Essentially (I think) if you use method 1 then SPSS applies Lillifor’s correction, but if you use method 2 it doesn’t. If you use method 3 then you just look like a weirdo.
So, is it better to use Lillifor’s correction or not? In the additional website material for my SPSS book, which no-one ever reads (the web material, not the book …) I wrote (self-plaigerism alert):
“If you want to test whether a model is a good fit of your data you can use a goodness-of-fit test (you can read about these in the chapter on categorical data analysis in the book), which has a chi-square test statistic (with the associated distribution). One problem with this test is that it needs a certain sample size to be accurate. The K–S test was developed as a test of whether a distribution of scores matches a hypothesized distribution (Massey, 1951). One good thing about the test is that the distribution of the K–S test statistic does not depend on the hypothesized distribution (in other words, the hypothesized distribution doesn’t have to be a particular distribution). It is also what is known as an exact test, which means that it can be used on small samples. It also appears to have more power to detect deviations from the hypothesized distribution than the chi-square test (Lilliefors, 1967). However, one major limitation of the K–S test is that if location (i.e. the mean) and shape parameters (i.e. the standard deviation) are estimated from the data then the K–S test is very conservative, which means it fails to detect deviations from the distribution of interest (i.e. normal). What Lilliefors did was to adjust the critical values for significance for the K–S test to make it less conservative (Lilliefors, 1967) using Monte Carlo simulations (these new values were about two thirds the size of the standard values). He also reported that this test was more powerful than a standard chi-square test (and obviously the standard K–S test).
Another test you’ll use to test normality is the Shapiro-Wilk test (Shapiro & Wilk, 1965) which was developed specifically to test whether a distribution is normal (whereas the K–S test can be used to test against other distributions than normal). They concluded that their test was ‘comparatively quite sensitive to a wide range of non-normality, even with samples as small as n = 20. It seems to be especially sensitive to asymmetry, long-tailedness and to some degree to short-tailedness.’ (p. 608). To test the power of these tests they applied them to several samples (n = 20) from various non-normal distributions. In each case they took 500 samples which allowed them to see how many times (in 500) the test correctly identified a deviation from normality (this is the power of the test). They show in these simulations (see table 7 in their paper) that the S-W test is considerably more powerful to detect deviations from normality than the K–S test. They verified this general conclusion in a much more extensive set of simulations as well (Shapiro, Wilk, & Chen, 1968).” 
So there you go. More people have probably read that now than when it was on the additional materials for the book. It Looks like Lillifor’s correction is a good thing (power wise) but you probably don’t want to be using K-S tests anyway really, or if you do interpret them within the context of the size of your sample and look at graphical displays of your scores too.

You Can’t Trust Your PhD Supervisor:)

My Ex Ph.D. supervisor Graham Davey posted a blog this morning about 10 ways to create false knowledge in psychology. It’s a tongue-in-cheek look at various things that academics do for various reasons. Two of his 10 points have a statistical theme, and they raise some issues in my mind. I could walk the 3 meters between my office and Graham’s to discuss these, or rib him gently about it next time I see him in the pub, but I thought it would be much more entertaining to write my own blog about it. A blog about a blog, if you will. Perhaps he’ll reply with a blog about a blog about a blog, and I can reply with a blog about a blog about a blog about a blog, and then we can end up in some kind of blog-related negative feedback loop that wipes us both from existence so that we’d never written the blogs in the first place. But that would be a paradox. Anyway, I digress.

Before I continue, let me be clear that during my PhD Graham taught me everything I consider worth knowing about the process of science, theory, psychopathology and academic life. So, I tend to listen to what he says (unless it’s about marriage or statistics) very seriously. The two points I want to focus on are:

2.  Do an experiment but make up or severely massage the data to fit your hypothesis. This is an obvious one, but is something that has surfaced in psychological research a good deal recently (;

Clearly the number of high profile retractions/sackings in recent times suggests that there is a lot of this about (not just in psychology). However, I think there is a more widespread problem than deliberate manipulation of data. For example, I remember reading somewhere about (I think) the Dirk Smeeters case or it might have been the Stapel one (see, I’m very scientific and precise); in any case, the person in question (perhaps it was someone entirely different), had claimed that they hadn’t thought they were doing anything wrong when applying the particular brand of massage therapy that they had applied to their data. So, although there are high profile cases of fraud that have been delved into, I think there is a wider problem of people simply doing the wrong thing with their data because they don’t know any better. I remind you of  Hoekstra, Henk, Kiers and Johnson’s recent study that asked recent postgraduate students about assumptions in their data, this paper showed (sort of) that recent postgraduate researchers don’t seem to check assumptions. I’d be very surprised if it’s just postgraduates. I would bet that assumptions, what they mean, when they matter and what to do about them are all concepts that are poorly understood amongst many very experiences researchers (not just within psychology). My suspicions are largely founded on the fact that I have only relatively recently really started to understand why and when these things matter, and I’m a geek who takes an interest in these things. I also would like to bet that the misunderstandings about assumptions and robustness of tests stem from being taught by people who poorly understand these things. I’m reminded of Haller & Kraus’ (2002) study showing that statistics teachers misunderstood p-values. The fourth edition of my SPSS book (plug: out early 2013) is the first book in which I really feel that I have handled the teaching of assumptions adequately – so I’m not at all blameless in all of this mess. (See two recently blogs on Normality and Homogeneity also.)

I’d really like to do a study looking at more experienced researcher’s basic understanding of assumptions (a follow up to Hoekstra’s study on a more experienced sample, and with more probing questions) just to see whether my suspicions are correct. Maybe I should email Hoekstra and see if they’re interested because, left to my own devices, I’ll probably never get around to it.

Anyway, my point is that I think it’s not just deliberate fraud that creates false knowledge, there is also a problem of well-intentioned and honest folk simply not understanding what to do, or when to do it.

3.  Convince yourself that a significant effect at p=.055 is real. How many times have psychologists tested a prediction only to find that the critical comparison just misses the crucial p=.05 value? How many times have psychologists then had another look at the data to see if it might just be possible that with a few outliers removed this predicted effect might be significant? Strangely enough, many published psychology papers are just creeping past the p=.05 value – and many more than would be expected by chance! Just how many false psychology facts has that created? (

This is a massive over-simplification because an effect of p = .055 is ‘real’ and might very well be ‘meaningfiul’. Conversely, an effect with p < .001 might very well be meaningless. To my mind it probably matters very little if a p is .055 or .049. I’m not suggesting I approve of massaging your data, but really this point illustrates how wedded psychologists are to the idea of effects somehow magically becoming ‘real’ or ‘meaningful’ once p drifts below .05. There’s a few points to make here:

First, all effects are ‘real’. There should never be a decision being made by anyone about whether an effect is real or not real. They’re all real. It’s just that some are large and some are small. There is a decision about whether an effect is meaningful, and that decision should be made within the context of the research question.

Second, I think an equally valid way to create ‘false knowledge’ is to publish studies based on huge samples reporting loads of small and meaningless effects that are highly significant. Imagine you look at the relationship between statistical knowledge and eating curry. You test 1 million people and find that there is a highly significant negative relationship, r = -.002, p < .001. You conclude that eating curry is a 'real' effect – it is meaningfully related to poorer statistical knowledge. There are two issues here: (1) in a sample of 1 million people the effect size estimate will be very precise, and the confidence interval very narrow. So we know the true effect in the population is going to be very close indeed to -.002. In other words, there is basically no effect in the population – eating curry and statistical knowledge have such a weak relationship that you may as well forget about it. (2) anyone trying to replicate this effect in a sample substantially smaller than 1 million is highly unlikely to get a significant result. You've basically published an effect that is 'real' if you use p < .05 to define your reality, but is utterly meaningless and won't replicate (in terms of p) in small samples.

Third, there is a wider problem than people massage their ps. You have to ask why people massage their ps. The answer to that is because psychology is so hung up on p-values. Over 10 years since the APA published their report on statistical reporting (Wilkinson, 1999) there has been no change in the practice of applying the all-or-nothing thinking of accepting results as ‘real’ if p < .05. It's true that Wilkinson's report has had a massive impact in the frequency with which effect sizes and confidence intervals are reported, but (in my experience which is perhaps not representative) these effect sizes are rarely interpreted with any substance and it is still the p-value that drives decisions made by reviewers and editors.

This whole problem would go away if ‘meaning’ and ‘substance’ of effects were treated not as a dichotomous decision, but as a point along a continuum. You quantify your effect, you construct a confidence interval around it, and you interpret it within the context of the precision that your sample size allows. This way, studies with large samples could no longer focus on meaningless but significant effects, instead the researcher could say (given  the high level of precision they have) that the effects in the population (the true effects if you like) are likely to be about the size that they observed and interpret accordingly. In small studies, rather than throwing out the baby with the bathwater, large effects could be given some creditability but with the caveat that the estimates in the study lack precision. This is where replication is useful. No need to massage data – researchers just give it to the reader as it is, interpret it and apply the appropriate caveats etc. One consequence of this might be that rather than publishing a single small study with massaged data to get p < .05, researchers might be encouraged to replicate their own study a few times and report them all in a more substantial paper. Doing so would mean that across a few studies you could show (regardless of p) the likely size of the effect in the population.

That turned into a bigger rant than I was intending ….


Haller, H., & Kraus, S. (2002). Misinterpretations of Significance: A Problem Students Share with Their Teachers? MPR-Online, 7(1), 1-20. 
Wilkinson, L. (1999). Statistical Methods in Psychology Journals: Guidelines and Explanations. American Psychologist, 54(8), 594-604.