## Newspapers and 7 Core Statistical Concepts

Anyway, all of this reminded me that when I’m trying to convince my students that statistics is a good thing to learn about, my main point is that it is a transferable skill that helps you to navigate the tricky terrain of life. After 3 years of a psychology degree (or any other degree that teaches applied statistics), you’re in the rather privileged position of being able to evaluate evidence for yourself. You don’t have to worry about whether the newspaper, or your GP, tells you not to vaccinate your child because the injection will grow them a second head, you can track down the research and evaluate the evidence for yourself.
To quote Utts, “What good is it to know how to carry out a t-test if a student can not read a newspaper article and determine that hypothesis testing has been misused?” 1, p. 78.  Utts 1 suggests seven core statistical ideas that could be described as ‘useful life skills’, which I summarized as 2:
1. When causal relationships can and cannot be inferred, including the difference between observational studies and randomized experiments;
2. The difference between statistical significance and practical importance, especially when using large sample sizes;
3. The difference between finding ‘no effect’ and finding no statistically significant effect, especially when sample sizes are small;
4. Sources of bias in surveys and experiments, such as poor wording of questions, volunteer response, and socially desirable answers;
5. The idea that coincidences and seemingly very improbable events are not uncommon because there are so many possibilities (to use a classic example, although most people would consider it an unbelievable coincidence/unlikely event to find two people in a group of 30 that share the same birthday, the probability is actually .7, which is fairly high);
6. ‘Confusion of the inverse’ in which a conditional probability in one direction is confused with the conditional probability in the other direction (for example, the prosecutor’s fallacy) ;
7. Understanding that variability is natural, and that ‘normal’ is not the same as ‘average’ (for example, the average male height in the UK is 175cm; although a man of 190cm is, therefore, well above average, his height is within the normal range of male heights).

In a book chapter I wrote on teaching statistics in higher education2, I suggest that we should try, if nothing else, to get students to leave their degree programs with these core skills. We could also think about using real world examples (not necessarily from within our own discipline) to teach students how to apply these skills. This could have several benefits: (1) it might make the class more interesting; (2) it helps students to apply knowledge beyond the realm of their major subject; and (3) it will undermine the power that newspapers and the media in general has to sensationalize research findings, spread misinformation, and encourage lazy thinking. So, my main point is that, as teachers, we could think about these things when teaching, and students might take comfort in the fact that the stats classes they endured might have given them a useful shield to fend off the haddock of misinformation with which the media slaps their faces every day.
Right, I’m off to restructured my statistics course around those 7 key ideas ….

### References

1. Utts J. What Educated Citizens Should Know About Statistics and Probability The American Statistician 2003;57(2):74-79
2. Field AP. Teaching Statistics. In: Upton D, Trapp A, editors. Teaching Psychology in Higher Education. Chichester, UK:: Wiley-Blackwell., 2010.

## Bias in End of Year Album Polls?

So, in rolls 2012 and out rolls another year. I like new year: it’s a time to fill yourself with optimism about the exciting things that you will do. Will this be the year that I write something more interesting than a statistics book, for example? It’s also a time of year to reflect upon all of the things that you thought you’d do last year but didn’t. That’s a bit depressing, but luckily 2011 was yesterday and today is a new year and a new set of hopes that have yet to fail to come to fruition.
It’s also the time of year that magazines publish their end-of-year polls. Two magazines that I regularly read are Metal Hammer and Classic Rock (because, in case isn’t obvious from my metal-themed website and podcasts, I’m pretty fond of heavy metal and classic rock). The ‘album of the year’ polls in these magazines are an end of year treat for me: it’s an opportunity to see what highly rated albums I overlooked, to wonder at how an album that I hate has turned up in the top 5, or to pull a bemused expression at how my favourite album hasn’t made it into the top 20. At my age, it’s good to get annoyed about pointless things.
Anyway, for years I have had the feeling that these end of year polls are biased. I don’t mean in any nefarious way, but simply that reviewers who contribute to these polls tend to rate recently released albums more highly than ones released earlier in the year. Primacy and recency effects are well established in cognitive psychology: if you ask people to remember a list of things, they find it easier to recall items at the start or end of the list. Music journalists are (mostly) human so it’s only reasonable that reviewers will succumb to these effects, isn’t it?
I decided to actually take some time off this winter Solstice, and what happens when you take time off? In my case, you get bored and start to wonder whether you can test your theory that end of year polls are biased. The next thing you know, you’re creating a spreadsheet with Metal Hammer and Classic Rock’s end of year polls in it, then you’re on Wikipedia looking up other useful information about these albums, then, when you should be watching the annual re-run of Mary Poppins you find that you’re getting R to do a nonparametric bootstrap of a mediation analysis. The festive period has never been so much fun.
Anyway, I took the lists of top 20 albums from both Metal Hammer and Classic Rock magazine. I noted down their position in the poll (1 = best album of the year, 20 = 20th best album of the year), I also found out what month each album was released. From this information I could calculate how many months before the poll the album came out (0 = album came out the same month as the poll, 12 = the album came out 12 months before the poll). I called this variable Time.since.release.
My theory implies that an album’s position in the end of year list (position) will be predicted from how long before the poll the album was released. A recency effect would mean that albums released close to the end of the year (i.e. low score onTime.since.release) will be higher up the end of year poll (remember that the lower the score, the higher up the poll the album is). So, we predict a positive relationship between position and Time.since.release.
Let’s look at the scatterplot:
Both magazines show a positive relationship: albums higher up the poll (i.e. low score on position) tend to have been released more recently (i.e., low score on the number of months ago that the album was released). This effect is much more pronounced though for Metal Hammer than for Classic Rock.
To quantify the relationship between position in the poll and time since the album was released we can look at a simple correlation coefficient. Our position data are a rank, not interval/ratio, so it makes sense to use Spearman’s rho, and we have a fairly small sample so it makes sense to bootstrap the confidence interval. For Metal Hammer we get (note that because of the bootstrapping you’ll get different results if you try to replicate this) rho = .428 (bias = –0.02, SE = 0.19) with a 95% bias corrected and accelerated confidence interval that does not cross zero (BCa 95% = .0092, .7466). The confidence interval is pretty wide, but tells us that the correlation in the population is unlikely to be zero (in other words, the true relationship between position in the poll and time since release is likely to be more than no relationship at all). Also, because rho is an effect size we can interpret its size directly, and .428 is a fairly large effect. In other words, Metal Hammer reviewers tend to rank recent albums higher than albums released a long time before the poll. They show a relatively large recency effect.
What about Classic Rock? rho = .038 (bias = –0.001, SE = 0.236) with a BCa 95% CI = –.3921, .5129). The confidence interval is again wide, but this time crosses zero (in fact, zero is more or less in the middle of it). This CI tells us that the true relationship between position in the poll and time since release is could be zero, in other words no relationship at all. We can again interpret the rho directly, and .038 is a very small (it’s close to zero). In other words, Classic Rock reviewers do not tend to rank recent albums higher than albums released a long time before the poll. They show virtually no recency effect. This difference is interesting (especially given there is overlap between contributors to the two magazines!).
It then occurred to me, because I spend far too much time thinking about this sort of thing, that perhaps it’s simply the case that better albums come out later in the year and this explains why Metal Hammer reviewers rate them higher. ‘Better’ is far too subjective a variable to quantify, however, it might be reasonable to assume that bands that have been around for a long time will produce good material (not always true, as Metallica’s recent turd floating in the loo-loo demonstrates). Indeed, Metal Hammer’s top 10 contained Megadeth, Anthrax, Opeth, and Machine Head: all bands who have been around for 15-20 years or more. So, in the interests of fairness to Metal Hammer reviewers, let’s see whether ‘experience’ mediates the relationship between position in the poll and time since the album’s release. Another 30 minutes on the internet and I had collated the number of studio albums produced by each band in the top 20. The number of studio albums seems like a reasonable proxy for experience (and better than years as a band, because some bands produce an album every 8 years and others every 2). So, I did a mediation analysis with a nonparametric bootstrap (thanks to the mediation package in R). The indirect effect was 0.069 with a 95% CI = –0.275, 0.537. The proportion of the effect explained by mediation was about 1%. In other words, the recency bias in the Metal Hammer end of year poll could not be explained by the number of albums that bands in the poll had produced in the past (i.e. experience). Basically, despite my best efforts to give them the benefit of the doubt, Metal Hammer critics are actually biased towards giving high ratings to more recently released albums.
These results might imply many things:
• Classic Rock reviewers are more objective when creating their end of year polls (because they over-ride the natural tendency to remember more recent things, like albums).
• Classic Rock reviewers are not human because they don’t show the recency effects that you expect to find in normal human beings. (An interesting possibility, but we need more data to test it …)
• Metal Hammer should have a ‘let’s listen to albums released before June’ party each November to remind their critics of albums released earlier in the year. (Can I come please and have dome free beer?)
• Metal Hammer should inversely weight reviewer’s ranks by the time since release so that albums released earlier in the year get weighted more heavily than recently released albums. (Obviously, I’m offering my services here …)
• I should get a life.

Ok, that’s it. I’m sure this is of no interest to anyone other than me, but it does at least show how you can use statistics to answer pointless questions. A bit like what most of us scientists do for a large portion of our careers. Oh, and if I don’t get a ‘Spirit of the Hammer’ award for 15 years worth of infecting students of statistics with my heavy metal musings then there is no justice ion the world. British Psychological Society awards and National Teaching Fellowships are all very well, but I need a spirit of the hammer on my CV (or at least Defender of the Faith).

Have a rockin’ 2012
andy
P.P.S. My top 11 (just to be different) albums of 2011 (the exact order is a bit rushed):
1. Opeth: Heritage
2. Wolves in the Throne Room: Celestial Lineage
3. Anthrax: Worship Music
4. Von Hertzen Brothers: Stars Aligned
5. Liturgy: Split LP (although the Oval side is shit)
6. Ancient Ascendent: The Grim Awakening
7. Graveyard: Hisingen Blues
8. Foo Fighters: Wasting Light
9. Status Quo: Quid Pro Quo
10. Manowar: Battle Hymns MMXI
11. Mastodon: The Hunter

## Should I Buy This Book?

I’m thinking of buying this book.

On the face of it it seems like the kind of book I’ll enjoy. Admittedly I’d enjoy a nice biography about Metallica or some other heavy rock band more, but I need to maintain the façade of loving statistics. It seems to be an historical account of why we use significance testing and why it’s a terrible idea. I’m fairly confident that I already know most of what it will say, but the synopsis promises some nice discipline-specific examples of the ‘train wreck’ that is hypothesis testing. I probably won’t know these and it seems like there’s potential for entertainment. However, the two reviews of this book (which are fairly positive) say the following:

Reviewer 1: “In statistics, a result is called statistically significant if it is unlikely to have occurred by chance.”

Reviewer 2: “A relationship between two variables is statistically significant if there is a low probability (usually less than five per cent) of it happening by chance.”

Both of which are wrong. A result is statistically significant if the observed effect/relationship/whatever is unlikely to have occurred GIVEN THAT THERE IS NO EFFECT/RELATIONSHIP/WHATEVER IN THE POPULATION.

So, although I probably will buy this book because it looks interesting, I offer up my own free version here in this blog. I know it’s insanely generous of me to give you a whole book for free, but I’m a caring kind of guy. So here it is:

The Cult of Statistical Significance and Why it Will Fry Your Brain
By
Andy P. Field

If people can read an entire book about a concept and still not understand what it is, then that concept is probably unnecessarily confusing, poorly conceived and should be buried in a lead box, chained with particularly curmudgeonly rattlesnakes, guarded by rabid hounds, and placed in Satan’s toilet bowl. Only someone with the guile of Indiana Jones should be able to retrieve it, and should they ever manage to, this person should be dipped in sugar and set upon by locusts. It turns out that significance testing is such a concept (although with N = 2 I probably didn’t have enough power to significance test my hypothesis).

The End

## The Joy of Confidence Intervals

In my last blog I mentioned that Null Hypothesis Significance Testing (NHST) was a bad idea (despite most of us having been taught it, use it and possibly teach it to future generations). I also said that confidence intervals are poorly understood. Coincidentally, a colleague of mine, knowing that I was of the ‘burn NHST at the stake’ brigade recommended this book by Geoff Cumming. It turns out that within the first 5 pages, it gives the most beautiful example of why confidence intervals tell us more than NHST. I’m going to steal Geoff’s argument blatantly, but with the proviso that anyone reading this blog buy his book, preferably two copies.
OK, imagine you’ve read Chapter 8 of my SPSS/SAS or R book in which I suggest that rather than cast rash judgments on a man for placing an eel up his anus to cure constipation, we use science to evaluate the efficacy of the man’s preferred intervention. You randomly allocate people with constipation to a treatment as usual group (TAU) or to placing an eel up their anus (intervention). You then find a good lawyer.
Imagine there were 10 studies (you can assume they are of a suitably high quality with no systematic differences between them) that had report such scientific endeavors. They have a measure of constipation as their outcome (let’s assume it’s a continuous measure). A positive difference between means indicates that the intervention was better than the control group at reducing constipation.
Here are the results:
Study           Difference
between
Means             t                       p
Study 1           4.193              3.229            0.002*
Study 2           2.082              1.743            0.086
Study 3           1.546              1.336            0.187
Study 4           1.509              0.890            0.384
Study 5           3.991              2.894            0.006*
Study 6           4.141             3.551             0.001*
Study 7           4.323             3.745             0.000*
Study 8           2.035             1.479             0.155
Study 9           6.246             4.889             0.000*
Study 10          0.863             0.565             0.577
OK, here’s a quiz. Which of these statements best reflects your interpretation of these data:
•  A. The evidence is equivocal, we need more research.
•  B. All of the mean differences show a positive effect of the intervention, therefore, we have consistent evidence that the treatment works.
•  C. Five of the studies show a significant result (p < .05), but the other 5 do not. Therefore, the studies are inconclusive: some suggest that the intervention is better than TAU, but others suggest there's no difference. The fact that half of the studies showed no significant effect means that the treatment is not (on balance) more successful in reducing symptoms than the control.
•  D. I want to go for C, but I have a feeling it’s a trick question.

Some of you, or at least those of you bought up to worship at the shrine of NHST probably went for C. If you didn’t then good for you. If you did, then don’t feel bad because if you believe in NHST then that’s exactly the answer you should give.
Now let’s look at the 95% confidence intervals for the mean differences in each study:
Note the mean differences correspond to those we have already seen (I haven’t been cunning and changed the data). Thinking about what confidence intervals show us, which of the statements A to D above best fits your view?
Hopefully, many of you who thought C before now think B. If you still think C, then I will explain why you should go for B:
A confidence interval is a boundary within which the population value falls 95 times out of 100. In other words, they reflect the likely true population value: 5 out of 100 will miss it, but 95 out of 100 contain the actual population value. Looking at our 10 studies, only 3 of the 7 contain zero (studies 3, 8 and 10) and for two of them (studies 3 and 10) they only just contain zero. Therefore, in 7 of the 10 studies the evidence suggests that the population difference between group means is NOT zero. In other words, there is an effect in the population (zero would mean no difference between the groups). So, 7 out of 10 studies suggest that the population value, the actual real difference between groups, is NOT ZERO. What’s more, even the 3 that do contain zero, show a positive difference, and only a relatively small portion of the tail of the CI is below zero. So, even in the three studies that have confidence intervals crossing zero, it is more likely than not that the population value is greater than zero. As such, across all 10 studies there is strong and consistent evidence that the population difference between means is greater than zero, reflecting a positive effect of the intervention compared to the TAU.
The main point that Cummings makes (he talks about meta-analysis too, but I’m bored of typing now) is that the dichotomous sig/non-significant thinking fostered by the NHST can lead you to radically different conclusions to those you would make if you simply look at the data with a nice, informative confidence interval. In short, confidence intervals rule, and NHST sucks.
More important, it should not be the case that the way we picture the data/results completely alters our conclusions. Given we’re stuck with NHST at least for now, we could do worse than use CIs as the necessary pinch of salt required when interpreting significance tests.
Hopefully, that explains some of the comments in my previous blog. I’m off to buy a second copy of Geoff’s book …

## Top 5 Statistical Fax Pas

In a recent article (Nieuwenhuis, et al., 2011, Nature Neuroscience, 14, 1105-1107), neuroscientists were shown to be statistically retarded … or something like that. Ben Goldacre wrote an article about this in the Guardian newspaper, which caused a bit of a kerfuffle amongst British psychologists because in the first published version he accidentally lumped psychologists in with neuroscientists. Us psychologists, being the sensitive souls that we are, decided that we didn’t like being called statistically retarded; we endure a lot of statistics classes during our undergraduate and postgraduate degrees, and we if we learnt nothing in them then the unbelievable mental anguish will have been for nothing.
Neuroscientists may have felt much the same, but unfortunately for them Nieuwenhuis, at the request of the British Psychological Society publication, The Psychologist, declared the sample of papers that he reviewed absent of psychologists. The deafening sonic eruption of people around the UK not giving a shit could be heard in Fiji.
The main finding from the Nieuwenhuis paper was that neuroscientists often make the error of thinking that a non-significant difference is different from a significant one. Hang on, that’s confusing. Let’s say group A’s anxiety levels change significantly over time (p = .049) and group B’s do not (p = .060), then neuroscientists tend to assume that the change in anxiety in group A is different to that in group B, whereas the average psychologist would know that you need to test whether the change in group A differs from the change in group B (i.e., look for a significant interaction).
My friend Thom Baguely wrote a nice blog about it. He asked whether psychologists were entitled to feel smug about not making the Nieuwenhuis error, and politely pointed out some errors that we do tend to make. This blog inspired me to write my top 5 common mistakes that should remind scientists of every variety that we probably shouldn’t meddle with things that we don’t understand; Statistics, for example.

### 5. Median splits

OK, I’m starting by cheating because this one is in Thom’s blog too, but scientists (psychologists especially) love nothing more than butchering perfectly good continuous variables with the rusty meat cleaver that is the median (or some other arbitrary blunt instrument). Imagine 4 children aged 2, 8, 9, and 16. You do a median split to compare ‘young’ (younger than 8.5) and old (older than 8.5). What you’re saying here is that a 2 year old is identical to an 8 year old, a 9 year old is identical to a 16 year old, and an 8 year old is completely different in every way to a 9 year old. If that doesn’t convince you that it’s a curious practice then read DeCoster, Gallucci, & Iselin, 2011; MacCallum, Zhang, Preacher, & Rucker, 2002.

### 4. Confidence intervals:

Using confidence intervals is a good idea – the APA statistics task force say so – except that no-one understands them. Well, behavioural neuroscientists, medics and psychologists don’t (Belia, Fidler, Williams, & Cumming, 2005). (see a nice summary of the Belia paper here). I think many scientists would struggle to say what a CI represents correctly, and many textbooks (including the first edition of my own Discovering Statistics Using SPSS) give completely incorrect, but commonly reproduced, explanations of what a CI means.

### 3. Assuming normally distributed data

I haven’t done it, but I reckon if you asked the average scientist what the assumptions of tests based on the normal distribution were, most would tell me that you need normally distributed data. You don’t. You typically need a normally-distributed sampling distributions or normally-distributed residuals/errors. The beauty of the central limit theorem is that in large samples the sampling distribution will be normal anyway so you’re sample data can be shaped exactly like a blue whale giving a large African elephant a piggyback and it won’t make a blind bit of difference.

### 2. Homogeneity of variance matters

Despite people like me teaching the next generation of scientists all about how homogeneity of variance/homoscedasticity should be carefully checked, the reality is that we should probably just do robust tests or use a bootstrap anyway and free ourselves from the Iron Maiden of assumptions that perforate our innards on a daily basis. Also, in regression, heteroscedasticity doesn’t really affect anything important (according to Gelman & Hill, 2007)

### 1. Hypothesis testing

In at number 1 as the top statistical faux pas is null hypothesis significance testing (NHST). With the honorable exceptions of physicists and a few others from the harder sciences, most scientists use NHST. Lots is written on why this practice is a bad idea (e.g., Meehl, 1978). To sum up (1) it stems from a sort of hideous experiment in which two quite different statistical philosophies were placed together on a work bench and joined using a staple gun; (2) a p –value is the probability of something given that something that is never true is true, which of course it isn’t, which means that you can’t really get anything useful from a p-value other than a publication in a journal; (3) it results in the kind of ridiculous situations in which people completely reject ideas because their p was .06, but lovingly embrace and copulate with other ideas because their p value was .049; (4) ps depend on sample size and consequently you find researchers who have just studied 1000 participants joyfully rubbing their crotch at a pitifully small and unsubstantive effect that, because of their large sample, has crept below the magical fast-track to publication that is p < .05; (5) no-one understands what a p-value is, not even research professors or people teaching statistics (Haller & Kraus, 2002). Physicists must literally shit their pants with laughter at this kind of behaviour.
Surely, the interaction oversight (or the ‘missing in interaction’ you might say) faux pas of the neuroscientists is the least of their (and our) worries.

## References

• Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. . Psychological Methods, 10, 389-396.
• DeCoster, J., Gallucci, M., & Iselin, A.-M. R. (2011). Best Practices for Using Median Splits, Artificial Categorization, and their Continuous Alternatives. Journal of Experimental Psychopathology., 2(2), 197-209.
• Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press.
• Haller, H., & Kraus, S. (2002). Misinterpretations of Significance: A Problem Students Share with Their Teachers? MPR-Online, 7(1), 1-20.
• MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7(1), 19-40.
• Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.

## Meta-Analysis and SEM

Someone recently asked me about how to incorporate results from Structural Equation Models into a meta-analysis. The short answer is ‘with great difficulty’, although that’s not terribly helpful.
One approach it to do the meta-analysis at the correlation level, which is good if the SEM studies report the zero-order correlations (which hopefully they do). That is, you use meta-analysis to estimate pooled values of correlations between variables. Imagine you had three variables: anxiety, parenting, age and a bunch of papers that report relationships between some of these papers. Let’s say you have 53 papers in total, and 52 report the correlation between age and anxiety, then you use these 52 studies to get a pooled value of r for that relationship. If only 9 of the studies report the relationship between age and anxiety, then you use only these to get a pooled r for this association and so on. You stick these pooled values into a correlation matrix and then do an SEM on it to test whatever model you want to test. So, the meta-analysis part of the analysis informs the correlation matrix on which the resulting SEM is based. So, it’s running an SEM on data that others have collected (and you have pooled together). This is called meta-analytic structural equation modeling (MASEM; Cheung & Chan, 2005, 2009). You can implement it with Mike Cheung’s metaSEM package in RMike Cheung has some great resources here.

## References

• Cheung, M. W. L., & Chan, W. (2009). A two-stage approach to synthesizing covariance matrices in meta-analytic structural equation modeling. Structural Equation Modeling, 16, 28-53
• Cheung, M. W. L., & Chan, W. (2005). Meta-analytic structural equation modeling: A two-stage approach. Psychological Methods, 10, 40-64.