For each chapter in Finding Ghosts in Your Data, I’ll include a few resources that I found interesting. This isn’t a bibliography, strictly speaking, as I might not use all of these in the course of writing, but they were at least worth noting.
Articles
Chapter three of the book is all about the process of formalizing anomaly detection. As humans, we’re pretty dang good at catching outliers, but the reasons why we are good at it don’t really apply to computers. Instead, we need a different way of thought to understand how best to attack the problem in an automated fashion.
The majority of this chapter deals with statistical distributions. To this end, I highly recommend Jason Brownlee’s gentle introduction to statistical data distributions. Jason covers a lot in a small amount of time, but it does give you some of the foundational knowledge you’ll need.
From there, Aswath Damodaran provides a review of statistical distributions. This paper tries to help you decide under what circumstances you should use a particular statistical distribution.
While looking at distributions, I introduce ideas like mean, variance, and standard deviation. Those are really good starting points, but relying on them can cause you heartache later on, especially if you’re building a system which adapts to input data. Instead, look at a concept called robust statistics. Frank Hampel explains some of the background behind the concept.
Once you’ve read that, Brian Ripley has a slightly deeper dive into robust statistics. I think Ripley’s intro point that robust statistics hasn’t seen much growth may no longer be valid—the paper was published 15 years ago and there’s been a bit of a resurgence in the topic.
Anthony Atkinson has a set of slides covering robust statistics in detail. This takes you one level deeper into the rabbit hole and really helps you see their value.
Finally, if you want to implement solutions using robust statistical methods, check out Patrick Mari and Rand Wilcox’s paper on using the WRS2 package in R.
At the end of the chapter, I have a brief note on control charts. These charts are extremely useful in a variety of industries, particularly manufacturing. The National Institute for Standards and Technology (NIST) has an outstanding handbook on engineering statistics and chapter 6 covers process control charts. Even if you only read the intro, you’ll get a good idea of what these are and why they’re useful.