This is part of the series Finding Ghosts in Your Data.

For the inaugural post in this series, I want to spend a few moments on terminology. What’s the difference between an “outlier” and an “anomaly”? The interesting thing is that there’s no clear delineation in the literature. As a quick example, Mehrotra, et al, define outliers and anomalies as the same thing: “substantial variations from the norm” (4), although they do point out in a footnote that you can think of anomalies in processes and outliers in data. By contrast, Aggarwal has a definition of outliers and anomalies that I prefer (1-4). Here is a quick summation of the difference.

Define Your Terms

An outlier is something significantly different from the norm. For example, in the following image, we can see our data follow a consistent distribution everywhere except for something on the edge of the graph. Those data points are outliers.

Data which follows a sort-of-normal distribution, except for the data which doesn’t.

Within the set of outliers, we can differentiate two classes: anomalies and noise. Anomalies are outliers which are interesting to us, and outliers which are not interesting to us are noise.

Isn’t That Interesting?

This leads to an important question: how do we determine if something is interesting or not? There’s a lot to this question and I won’t be able to answer it conclusively here—in fact, it’s an answer which is going to take the majority of the book to sort out—but let’s lay out a few pointers and take it from there.

Different Models

Our first point is actually another definition of anomalies: anomalies can be defined as data which appears to be generated from a different model than the rest of the data. Looking at the image above, we can easily imagine that the bulk of our data comes from some operation which follows a normal distribution. But that other data is approximately 8 standard deviations from the mean. To give you an idea of how likely it is that we’d get a point that far out, there’s a 1 in 390,682,215,445 chance that we’d get something at 7 standard deviations from the mean. Having multiple data points 7 or 8 standard deviations from the mean is so improbable that in reality, those data points probably didn’t come from the same process which generated the rest, and the fact that they come from a different process is interesting, therefore making those outliers anomalous.

Distance from Normality

I also snuck in a second conception of what makes an outlier anomalous: distance from normality. The idea here is that the further away a thing is from the norm, the more likely it is to be an outlier. That’s because our processes are probabilistic. If I say that a golfer can sink this putt from 8 feet, what I mean is that the golfer will hit the ball and it will end up in the hole with some fairly high probability, but not always. The ball might end up a few inches away from the hole with some probability, or it could end up a few feet from the hole with some smaller probability. Therefore, if I pluck one example, drop a golfer in to make an 8′ putt, and the ball ends up 5″ away from the hole, missing the putt is an outlier event, but it’s still within the realm of expectations. From our standpoint of analysis, this is noise. If the ball ends up 50′ away, there’s a totally different story in here.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s