For each chapter in Finding Ghosts in Your Data, I’ll include a few resources that I found interesting. This isn’t a bibliography, strictly speaking, as I might not use all of these in the course of writing, but they were at least worth noting.
Approaching Normality
Chapter 6 looked at an initial slate of tests. In chapter 7, we introduce a new set of tests. The thing is, though, that these tests generally assume that our data follows a normal distribution.
When it comes to figuring out whether your data follows from a normal distribution, there are several techniques and Jason Brownlee’s guide is a really good one. I decided to take three separate tests and use them as the basis for my determination. The first is the Shapiro-Wilk test (Shapiro()
in scikit-learn), the second is D’Agostino’s K^2 test (Normaltest()
in scikit-learn), and the third is the Anderson-Darling test (Anderson()
in scikit-learn). My thinking here is that if all three tests agree that a dataset follows from a normal distribution, then I’m pretty safe in calling it that; otherwise, it probably isn’t normal.
If I need to transform data, that’s where I use Box-Cox transformation to get to normal-ish data. I say “normal-ish” because there’s no guarantee the result will actually follow from a normal distribution. Still, it does a good job on the whole.
Normal-Friendly Tests
There are three new tests that I added in this chapter. The first is Grubbs’ Test for Outliers. It’s a simple test to implement and works reasonably well at determining if one data point is an outlier. I ended up using the scikit_posthocs library to implement this test.
The next test is the Generalized Extreme Studentized Deviate (ESD) test. This is a general form of Grubbs’ test and has the advantage that it can detect more than one outlier. In general, you specify the maximum number of outliers that your dataset may have and it will pick out up to that number of outliers. There are tradeoffs you have to consider when using it, particularly around degrees of freedom. There’s a nice package for this in R but because the book uses Python for development, I am once again reliant upon the scikit_posthocs package for its implementation.
The final test is Dixon’s Q test. Sebastian Raschka has a great explanation of the technique as well as its drawbacks. In the end, I decided to move forward with this because it’s one test among many and I’m not removing data points which fail this test; I’m emphasizing them.