FAQ #1: K-S Tests in SPSS

I decided to start a series of blogs on questions that I get asked a lot. When I say a series I’m probably raising expectation unfairly: anyone who follows this blog will realise that I’m completely crap at writing blogs. Life gets busy. Sometimes I need to sleep. But only sometimes.

Anyway, I do get asked a lot about why there are two ways to do the Kolmogorov-Smirnov (K-S) test in SPSS. In fact, I got an email only this morning. I knew I’d answered this question many times before, but I couldn’t remember where I might have saved a response. Anyway, I figured if I just blog about it then I’d have a better idea of where I’d written a response. So, here it is. Anyway, notwithstanding my reservations about using the K-S test (you’ll have to wait until edition 4 of the SPSS book), there are three ways to get one from SPSS:

  1. Analyze>explore>plots> normality plots with tests
  2. Nonparametric Tests>One Sample … (or legacy dialogues>one sample KS)
  3. Tickle SPSS under the chin and whisper sweet nothings into its ear
These methods give different results. Why is that? Essentially (I think) if you use method 1 then SPSS applies Lillifor’s correction, but if you use method 2 it doesn’t. If you use method 3 then you just look like a weirdo.
So, is it better to use Lillifor’s correction or not? In the additional website material for my SPSS book, which no-one ever reads (the web material, not the book …) I wrote (self-plaigerism alert):
“If you want to test whether a model is a good fit of your data you can use a goodness-of-fit test (you can read about these in the chapter on categorical data analysis in the book), which has a chi-square test statistic (with the associated distribution). One problem with this test is that it needs a certain sample size to be accurate. The K–S test was developed as a test of whether a distribution of scores matches a hypothesized distribution (Massey, 1951). One good thing about the test is that the distribution of the K–S test statistic does not depend on the hypothesized distribution (in other words, the hypothesized distribution doesn’t have to be a particular distribution). It is also what is known as an exact test, which means that it can be used on small samples. It also appears to have more power to detect deviations from the hypothesized distribution than the chi-square test (Lilliefors, 1967). However, one major limitation of the K–S test is that if location (i.e. the mean) and shape parameters (i.e. the standard deviation) are estimated from the data then the K–S test is very conservative, which means it fails to detect deviations from the distribution of interest (i.e. normal). What Lilliefors did was to adjust the critical values for significance for the K–S test to make it less conservative (Lilliefors, 1967) using Monte Carlo simulations (these new values were about two thirds the size of the standard values). He also reported that this test was more powerful than a standard chi-square test (and obviously the standard K–S test).
Another test you’ll use to test normality is the Shapiro-Wilk test (Shapiro & Wilk, 1965) which was developed specifically to test whether a distribution is normal (whereas the K–S test can be used to test against other distributions than normal). They concluded that their test was ‘comparatively quite sensitive to a wide range of non-normality, even with samples as small as n = 20. It seems to be especially sensitive to asymmetry, long-tailedness and to some degree to short-tailedness.’ (p. 608). To test the power of these tests they applied them to several samples (n = 20) from various non-normal distributions. In each case they took 500 samples which allowed them to see how many times (in 500) the test correctly identified a deviation from normality (this is the power of the test). They show in these simulations (see table 7 in their paper) that the S-W test is considerably more powerful to detect deviations from normality than the K–S test. They verified this general conclusion in a much more extensive set of simulations as well (Shapiro, Wilk, & Chen, 1968).” 
So there you go. More people have probably read that now than when it was on the additional materials for the book. It Looks like Lillifor’s correction is a good thing (power wise) but you probably don’t want to be using K-S tests anyway really, or if you do interpret them within the context of the size of your sample and look at graphical displays of your scores too.

You Can’t Trust Your PhD Supervisor:)

My Ex Ph.D. supervisor Graham Davey posted a blog this morning about 10 ways to create false knowledge in psychology. It’s a tongue-in-cheek look at various things that academics do for various reasons. Two of his 10 points have a statistical theme, and they raise some issues in my mind. I could walk the 3 meters between my office and Graham’s to discuss these, or rib him gently about it next time I see him in the pub, but I thought it would be much more entertaining to write my own blog about it. A blog about a blog, if you will. Perhaps he’ll reply with a blog about a blog about a blog, and I can reply with a blog about a blog about a blog about a blog, and then we can end up in some kind of blog-related negative feedback loop that wipes us both from existence so that we’d never written the blogs in the first place. But that would be a paradox. Anyway, I digress.

Before I continue, let me be clear that during my PhD Graham taught me everything I consider worth knowing about the process of science, theory, psychopathology and academic life. So, I tend to listen to what he says (unless it’s about marriage or statistics) very seriously. The two points I want to focus on are:

2.  Do an experiment but make up or severely massage the data to fit your hypothesis. This is an obvious one, but is something that has surfaced in psychological research a good deal recently (http://bit.ly/QqF3cZ;http://nyti.ms/P4w43q).

Clearly the number of high profile retractions/sackings in recent times suggests that there is a lot of this about (not just in psychology). However, I think there is a more widespread problem than deliberate manipulation of data. For example, I remember reading somewhere about (I think) the Dirk Smeeters case or it might have been the Stapel one (see, I’m very scientific and precise); in any case, the person in question (perhaps it was someone entirely different), had claimed that they hadn’t thought they were doing anything wrong when applying the particular brand of massage therapy that they had applied to their data. So, although there are high profile cases of fraud that have been delved into, I think there is a wider problem of people simply doing the wrong thing with their data because they don’t know any better. I remind you of  Hoekstra, Henk, Kiers and Johnson’s recent study that asked recent postgraduate students about assumptions in their data, this paper showed (sort of) that recent postgraduate researchers don’t seem to check assumptions. I’d be very surprised if it’s just postgraduates. I would bet that assumptions, what they mean, when they matter and what to do about them are all concepts that are poorly understood amongst many very experiences researchers (not just within psychology). My suspicions are largely founded on the fact that I have only relatively recently really started to understand why and when these things matter, and I’m a geek who takes an interest in these things. I also would like to bet that the misunderstandings about assumptions and robustness of tests stem from being taught by people who poorly understand these things. I’m reminded of Haller & Kraus’ (2002) study showing that statistics teachers misunderstood p-values. The fourth edition of my SPSS book (plug: out early 2013) is the first book in which I really feel that I have handled the teaching of assumptions adequately – so I’m not at all blameless in all of this mess. (See two recently blogs on Normality and Homogeneity also.)

I’d really like to do a study looking at more experienced researcher’s basic understanding of assumptions (a follow up to Hoekstra’s study on a more experienced sample, and with more probing questions) just to see whether my suspicions are correct. Maybe I should email Hoekstra and see if they’re interested because, left to my own devices, I’ll probably never get around to it.

Anyway, my point is that I think it’s not just deliberate fraud that creates false knowledge, there is also a problem of well-intentioned and honest folk simply not understanding what to do, or when to do it.

3.  Convince yourself that a significant effect at p=.055 is real. How many times have psychologists tested a prediction only to find that the critical comparison just misses the crucial p=.05 value? How many times have psychologists then had another look at the data to see if it might just be possible that with a few outliers removed this predicted effect might be significant? Strangely enough, many published psychology papers are just creeping past the p=.05 value – and many more than would be expected by chance! Just how many false psychology facts has that created? (http://t.co/6qdsJ4Pm).

This is a massive over-simplification because an effect of p = .055 is ‘real’ and might very well be ‘meaningfiul’. Conversely, an effect with p < .001 might very well be meaningless. To my mind it probably matters very little if a p is .055 or .049. I’m not suggesting I approve of massaging your data, but really this point illustrates how wedded psychologists are to the idea of effects somehow magically becoming ‘real’ or ‘meaningful’ once p drifts below .05. There’s a few points to make here:

First, all effects are ‘real’. There should never be a decision being made by anyone about whether an effect is real or not real. They’re all real. It’s just that some are large and some are small. There is a decision about whether an effect is meaningful, and that decision should be made within the context of the research question.

Second, I think an equally valid way to create ‘false knowledge’ is to publish studies based on huge samples reporting loads of small and meaningless effects that are highly significant. Imagine you look at the relationship between statistical knowledge and eating curry. You test 1 million people and find that there is a highly significant negative relationship, r = -.002, p < .001. You conclude that eating curry is a 'real' effect – it is meaningfully related to poorer statistical knowledge. There are two issues here: (1) in a sample of 1 million people the effect size estimate will be very precise, and the confidence interval very narrow. So we know the true effect in the population is going to be very close indeed to -.002. In other words, there is basically no effect in the population – eating curry and statistical knowledge have such a weak relationship that you may as well forget about it. (2) anyone trying to replicate this effect in a sample substantially smaller than 1 million is highly unlikely to get a significant result. You've basically published an effect that is 'real' if you use p < .05 to define your reality, but is utterly meaningless and won't replicate (in terms of p) in small samples.

Third, there is a wider problem than people massage their ps. You have to ask why people massage their ps. The answer to that is because psychology is so hung up on p-values. Over 10 years since the APA published their report on statistical reporting (Wilkinson, 1999) there has been no change in the practice of applying the all-or-nothing thinking of accepting results as ‘real’ if p < .05. It's true that Wilkinson's report has had a massive impact in the frequency with which effect sizes and confidence intervals are reported, but (in my experience which is perhaps not representative) these effect sizes are rarely interpreted with any substance and it is still the p-value that drives decisions made by reviewers and editors.

This whole problem would go away if ‘meaning’ and ‘substance’ of effects were treated not as a dichotomous decision, but as a point along a continuum. You quantify your effect, you construct a confidence interval around it, and you interpret it within the context of the precision that your sample size allows. This way, studies with large samples could no longer focus on meaningless but significant effects, instead the researcher could say (given  the high level of precision they have) that the effects in the population (the true effects if you like) are likely to be about the size that they observed and interpret accordingly. In small studies, rather than throwing out the baby with the bathwater, large effects could be given some creditability but with the caveat that the estimates in the study lack precision. This is where replication is useful. No need to massage data – researchers just give it to the reader as it is, interpret it and apply the appropriate caveats etc. One consequence of this might be that rather than publishing a single small study with massaged data to get p < .05, researchers might be encouraged to replicate their own study a few times and report them all in a more substantial paper. Doing so would mean that across a few studies you could show (regardless of p) the likely size of the effect in the population.

That turned into a bigger rant than I was intending ….

References

Haller, H., & Kraus, S. (2002). Misinterpretations of Significance: A Problem Students Share with Their Teachers? MPR-Online, 7(1), 1-20. 
Wilkinson, L. (1999). Statistical Methods in Psychology Journals: Guidelines and Explanations. American Psychologist, 54(8), 594-604. 

Assumptions Part 2: Homogeneity of Variance/Homoscedasticity

My last blog was about the assumption of normality, and this one continues the theme by looking at homogeneity of variance (or homoscedasticity to give it its even more tongue-twisting name). Just to remind you, I’m writing about assumptions because this paper showed (sort of) that recent postgraduate researchers don’t seem to check them. Also, as I mentioned before, I get asked about assumptions a lot. Before I get hauled up before a court for self-plaigerism I will be up front and say that this is an edited extract from the new edition of my Discovering Statistics book. If making edited extracts of my book available for free makes me a bad and nefarious person then so be it.

Assumptions: A reminder

Now, I’m even going to self-plagiarize my last blog to remind you that most of the models we fit to data sets are based on the general linear model, (GLM). This fact means that any assumption that applies to the GLM (i.e., regression) applies to virtually everything else. You don’t really need to memorize a list of different assumptions for different tests: if it’s a GLM (e.g., ANOVA, regression etc.) then you need to think about the assumptions of regression. The most important ones are:

  • Linearity
  • Normality (of residuals) 
  • Homoscedasticity (aka homogeneity of variance) 
  • Independence of errors. 

What Does Homoscedasticity Affect?

Like normality, if you’re thinking about homoscedasticity, then you need to think about 3 things:

  1. Parameter estimates: That could be an estimate of the mean, or a b in regression (and a b in regression can represent differences between means). if we assume equality of variance then the estimates we get using the method of least squares will be optimal. 
  2. Confidence intervals: whenever you have a parameter, you usually want to compute a confidence interval (CI) because it’ll give you some idea of what the population value of the parameter is. 
  3. Significance tests: we often test parameters against a null value (usually we’re testing whether b is different from 0). For this process to work, we assume that the parameter estimates have a normal distribution. 

When Does The Assumption Matter?

With reference to the three things above, let’s look at the effect of heterogeneity of variance/heteroscedasticity:

  1. Parameter estimates: If variances for the outcome variable differ along the predictor variable then the estimates of the parameters within the model will not be optimal. The method of least squares (known as ordinary least squares, OLS), which we normally use, will produce ‘unbiased’ estimates of parameters even when homogeneity of variance can’t be assumed, but better estimates can be achieved using different methods, for example, by using weighted least squares (WLS) in which each case is weighted by a function of its variance. Therefore, if all you care about is estimating the parameters of the model in your sample then you don’t need to worry about homogeneity of variance in most cases: the method of least squares will produce unbiased estimates (Hayes & Cai, 2007). However, if you even better estimates, then use weighted least squares regression to estimate the parameters. 
  2. Confidence intervals: unequal variances/heteroscedasticity creates a bias and inconsistency in the estimate of the standard error associated with the parameter estimates in your model (Hayes & Cai, 2007). As such, your confidence intervals and significance tests for the parameter estimates will be biased, because they are computed using the standard error. Confidence intervals can be ‘extremely inaccurate’ when homogeneity of variance/homoscedasticity cannot be assumed (Wilcox, 2010). 
  3. Significance tests: same as above. 

Summary

If all you want to do is estimate the parameters of your model then homoscedasticity doesn’t really matter: if you have heteroscedasticity then using weighted least squares to estimate the parameters will give you better estimates, but the estimates from ordinary least squares will be ‘unbiased’ (although not as good as WLS). 
If you’re interested in confidence intervals around the parameter estimates (bs), or significance tests of the parameter estimates then homoscedasticity does matter. However, many tests have variants to cope with these situations; for example, the t-test, the Brown-Forsythe and Welch adjustments in ANOVA, and numerous robust variants described by Wilcox (2010) and explained, for R, in my book (Field, Miles, & Field, 2012

Declaration


This blog is based on excerpts from the forthcoming 4th edition of ‘Discovering Statistics Using SPSS: and sex and drugs and rock ‘n’ roll’.

References

  • Field, A. P., Miles, J. N. V., & Field, Z. C. (2012). Discovering statistics using R: And sex and drugs and rock ‘n’ roll. London: Sage. 
  • Hayes, A. F., & Cai, L. (2007). Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation. Behavior Research Methods, 39(4), 709-722. 
  • Wilcox, R. R. (2010). Fundamentals of modern statistical methods: substantially improving power and accuracy. New York: Springer.

<!–[if supportFields]><![endif]–>

Assumptions Part 1: Normality

…. I didn’t grow a pair of breasts. If you didn’t read my last blog that comment won’t make sense, but it turns out that people like breasts so I thought I’d mention them again. I haven’t written a lot of blogs, but my frivolous blog about growing breasts as a side effect of some pills was (by quite a large margin) my most viewed blog. It’s also the one that took me the least time to write and that I put the least thought into. I think the causal factor might be the breasts.

This blog isn’t about breasts, it’s about normality. Admittedly the normal distribution looks a bit like a nipple-less breast, but it’s not one: I’m very happy that my wife does not sport two normal distributions upon her lovely chest. I like stats, but not that much …

Assumptions

Anyway, I recently stumbled across this paper. The authors sent a sample of postgrads (with at least 2 years research experience) a bunch of data analysis scenarios and asked them how they would analyze the data. They were interested in whether or not, and how these people checked the assumptions of the tests they chose to use. The good news was that they chose the correct test (although given all of the scenarios basically required a general linear model of some variety that wasn’t hard). However, not many of them checked assumptions. The conclusion as that people don’t understand assumptions or how to test them

I get asked about assumptions a lot. I also have to admit to hating the chapter on assumptions in my SPSS and R books. Well, hate is a strong word, but I think it toes a very conservative and traditional line. In my recent update of the SPSS book (out early next year before you ask) I completely re-wrote this chapter. It takes a very different approach to thinking about assumptions.

Most of the models we fit to data sets are based on the general linear model, (GLM) which means that any assumption that applies to the GLM (i.e., regression) applies to virtually everything else. You don’t really need to memorize a list of different assumptions for different tests: if it’s a GLM (e.g., ANOVA, regression etc.) then you need to think about the assumptions of regression. The most important ones are:

  • Linearity
  • Normality (of residuals)
  • Homoscedasticity (aka homogeneity of variance)
  • Independence of errors.

What Does Normality Affect?

For this post I’ll discuss normality. If you’re thinking about normality, then you need to think about 3 things that rely on normality:

  1. Parameter estimates: That could be an estimate of the mean, or a b in regression (and a b in regression can represent differences between means). Models have error (i.e., residuals), and if these residuals are normally distributed in the population then using the method of least squares to estimate the parameters (the bs) will produce better estimates than other methods.
  2. Confidence intervals: whenever you have a parameter, you usually want to compute a confidence interval (CI) because it’ll give you some idea of what the population value of the parameter is. We use values of the standard normal distribution to compute the confidence interval: using values of the standard normal distribution makes sense only if the parameter estimates actually come from one.
  3. Significance tests: we often test parameters against a null value (usually we’re testing whether b is different from 0). For this process to work, we assume that the parameter estimates have a normal distribution. We assume this because the test statistics that we use (such as the t, F and chi-square), have distributions related to the normal. If parameter estimates don’t have a normal distribution then p-values won’t be accurate. 

What Does The Assumption Mean?

People often think that your data need to be normally distributed, and that’s what many people test. However, that’s not the case. What matters is that the residuals in the population are normal, and the sampling distribution of parameters is normal. However, we don’t have access to the sampling distribution of parameters or population residuals; therefore, we have to guess at what might be going on by testing the data instead.

When Does The Assumption Matter?

However, the central limit theorem tells us that no matter what distribution things have, the sampling distribution will be normal if the sample is large enough. How large is large enough is another matter entirely and depends a bit on what test statistic you want to use. So bear that in mind. However, oversimplifying things a bit, we could say:

  1. Confidence intervals: For confidence intervals around a parameter estimate to be accurate, that estimate must come from a normal distribution. The central limit theorem tells us that in large samples, the estimate will have come from a normal distribution regardless of what the sample or population data look like. Therefore, if we are interested in computing confidence intervals then we don’t need to worry about the assumption of normality if our sample is large enough. (There is still the question of how large is large enough though.) You can easily construct bootstrap confidence intervals these days, so if your interest is confidence intervals then why not stop worrying about normality and use bootstrapping instead?
  2. Significance tests: For significance tests of models to be accurate the sampling distribution of what’s being tested must be normal. Again, the central limit theorem tells us that in large samples this will be true no matter what the shape of the population. Therefore, the shape of our data shouldn’t affect significance tests provided our sample is large enough. (How large is large enough depends on the test statistic and the type of non-normality. Kurtosis for example tends to screw things up quite a bit.) You can make a similar argument for using bootstrapping to get a robust if p is your thing.
  3. Parameter Estimates: The method of least squares will always give you an estimate of the model parameters that minimizes error, so in that sense you don’t need to assume normality of anything to fit a linear model and estimate the parameters that define it (Gelman & Hill, 2007). However, there are other methods for estimating model parameters, and if you happen to have normally distributed errors then the estimates that you obtained using the method of least squares will have less error than the estimates you would have got using any of these other methods. 

Summary

If all you want to do is estimate the parameters of your model then normality doesn’t really matter. If you want to construct confidence intervals around those parameters, or compute significance tests relating to those parameters then the assumption of normality matters in small samples, but because of the central limit theorem we don’t really need to worry about this assumption in larger samples. The question of how large is large enough is a complex issue, but at least you know now what parts of your analysis will go screwy if the normality assumption is broken..

This blog is based on excerpts from the forthcoming 4th edition of ‘Discovering Statistics Using SPSS: and sex and drugs and rock ‘n’ roll’.