This is part seven in a series on applying forensic accounting techniques to SQL and R.
What Have We Learned?
Over the course of this series, we learned what it means to know your data. Oh, sure, the premise of the series was to discover and uncover fraud in a data set, but the point is that these techniques help you learn the distributions and subtle interactions of your data, things which can help you considerably as a data platform professional.
Why This All Matters
You can be a good database administrator or database developer and not really know your data at this level. You can even be a great database administrator and not care that much about the data itself. But knowing how to perform these tests can give you an edge in several ways:
- If you know about time-based skewness of data, you might be able to put together a partitioning strategy to improve performance. If most data is static after 90 days (or one year or whatever), you might be able to offload it to a table with 100% fill factor and more non-clustered indexes, whereas the more frequently-churning data would live in tables with fewer indexes. Or perhaps a memory-optimized table.
- Knowing about cohorts lets you figure out which customers are behemoths in terms of resource utilization. You might be able to build special processes for these major customers to avoid parameter sniffing issues. At an extreme, you might learn enough about the data that it could make sense to move a mega-customer off onto its own database or even server.
- Having an idea of cardinality by column lets you trace out which values probably come from lookup lists and which are probably user-entered. It might also let you find dirty data—for example, Salutation is a field which probably should not have that many values. If you see a large number of results for salutation, you dig into it and see Mr., Mr, Mister, MR, and so on. This could be an area of code to clean up, storing a list of valid salutations and having users select them from a drop-down list. If you have hundreds of these all over your code base, looking one by one is tedious; using
rapply()in R lets you see your areas of focus.
- If you need to generate realistic-looking test data, building data off of a normal distribution is usually a lot better than just running
RAND()on things. Complaining about a lack of data in lower environments is all well and good, but here’s a technique to generate a lot of realistic data knowing a couple of data points (mean and standard deviation).
This is off the top of my head; I’m sure you can come up with additional benefits as well.
Although I covered a number of tactics in this series, this is not the entirety of forensic accounting techniques. There are others which are interesting, too. For example, you can use market basket analysis to find relationships between items or, just as importantly, where relationships should not be. There are a number of other techniques and subtle modifications on the techniques I shared, but this series should get you a bit closer to knowing your data.