This is part five in a series on applying forensic accounting techniques to SQL and R.
Dealing with Time
This post will be a pretty short one. In my talk, I don’t have any demos, mostly because much of cohort analysis has secretly been time series analysis at the same time. Instead, I’ll lob out a few points and call it a day.
Time series analysis, at its core, is all about how your data changes over time. The grain for time series analysis is important: as we saw in the last post, we were able to get an excellent result at the yearly level when regressing number of active buses versus number of line items.
Going down to the monthly grain, the number of active buses explained none of the monthly variance within a year and so it was a much weaker correlation.
This is one of the difficulties when doing time series work: the specific grain you work against can make or break an analysis.
Use Calendar Tables
This is a strong recommendation for pretty much any SQL Server instance: have a calendar table. I have a two-part series on calendar tables, with part 1 showing how to build a table and part 2 showing sample queries.
Calendar tables will allow you to define your grain to almost any level. If you store data as dates, a calendar table can let you find such weird items as the third Tuesday of each month, Sundays after weeks with holidays on Wednesdays, or the third quarter of fiscal year 2017 (oh, I meant federal fiscal year, not state fiscal year).
Simple Analysis & Seasonality
One of the simplest forms of analysis is count of items per day or week or month. The goal is to look for seasonality: repeated periodic behavior. You might see more sales on one day of the week or one day of the month or one month of the year. You can then use techniques like ARIMA to project forward and estimate future behavior given prior trends.
Additive and Semi-Additive Measures
When performing a time series analysis, it’s important to understand whether your measure is additive, semi-additive, or non-additive with respect to time. Additivity is, simply, whether you can sum results across a dimension and get reasonable results.
For example, an additive measure over time is the sum of deposits in a bank account. If you deposit $10 today, $50 tomorrow, and $100 a week from now, you will have deposited a total of $160. Additive measures give us the most flexibility in how we analyze the data, so we typically prefer to deal with them.
Non-additive measures cannot be aggregated across any dimension, whether time or otherwise. An example of this is a person’s account number. You can’t sum up all of the account numbers of customers at your bank and get something meaningful.
Semi-additive measures fit somewhere in between: they are additive across certain dimensions and non-additive across others. Bank account balance is a great example of a semi-additive measure. Bank account balances are additive across customers: you can sum up all of the balances and determine the total amount of deposits in the bank at a particular point in time. What you cannot do, however, is sum up one customer’s account balances over different points in time and get a meaningful number. If my bank account had $100 in it yesterday and $100 in it today, I don’t have a claim for $200; I have $100 because that is my current balance.
With semi-additive measures, you have a couple of routes for analysis. First, you can look at the state of the measure at fixed points in time. For example, we might look at every account at 9 PM UTC on a daily basis; that helps us see how account balances change and get a consistent view of the bank’s holdings across time.
The other technique is to use feature engineering to convert from balances to debits and credits—that is, the individual transactions which make up the total balance. If you have those, then the values become additive across time and you could use those instead of account balances.
Looking at Date Oddities
In this section, I’ll briefly cover some date oddities which might be worth investigating. First, you can look for “odd” date combinations: do you have order dates after ship dates? Or do you have adjustments after a journal closes? When you have a natural temporal flow, it’s interesting to look for variances in that flow because they’ll tell you a lot about exceptions to processes. For example, there might be a rare circumstance where an order date comes after a ship date because your company might ship out demo products without an order and if a customer buys the unit you fill in the order date and let the customer hang onto the product. These rules are usually hard to discern and are the fuzzy domain of subject matter experts (who themselves may not have thought through the implications of these choices).
The other area where you might find oddities is to focus on date attributes. For example, slice data by the day of the week, by the day of the month, by weekdays versus weekends, by mid-week versus early week versus late week. You might find that invoices never get processed on weekends (because it’s a government agency and nobody works weekends) until one time a bunch of invoices get processed on a Saturday. Simple sums, counts, and averages can get you a long way toward understanding user behavior in these cases.
Don’t Forget About Holidays
Holidays change behavior as well. On some holidays, like Singles Day, sales can go through the roof . On others, like Christmas Day, sales tend to be much lower than you’d expect from just another day of the week. If you’re dealing with worker behavior, their holidays matter too: you won’t see much work done on Christmas Day because they’re all at home.
Also, be aware of floating holidays like Easter and observed holidays like Christmas Day observed. If December 25th is on a Sunday, then companies tend to observe the holiday on December 26th, meaning that even though it’s technically not a holiday, December 26th will still exhibit that kind of holiday behavior in certain circumstances.
It’s easiest to use a calendar table (hint, hint) so you can avoid calculating these kinds of rules on the fly.
This is a tiny fraction of everything you want to do with time series, but should serve as a quick primer on things to think about as you try to understand your data better.