Forensic Accounting: Basic Analysis

This is part three in a series on applying forensic accounting techniques to SQL and R.

Getting Exploratory

The term “basic analysis” is pretty broad, so we’re going to cover a few concepts today. Most of this is in the realm of Exploratory Data Analysis, looking at the data to gain a better understanding of what you have before you generate hypotheses. This sounds bland and straightforward today, which is an indication of how well Tukey’s book was received and how insightful he was.

In today’s post, we’ll look at exploring our data along five avenues: summary analysis, growth analysis, gaps in data, duplicates and cardinality, and regression analysis. Most of this work will be in R, using the DBI package to connect to SQL Server.

Summary Analysis

Summary analysis is the easiest of these and generally should be your starting point. Here, you simply want to get an idea of the shape of your data. If you’re in SQL, I tend to start with a SELECT TOP(100) * from the table so I can get an idea of columns, value spread, and data lengths without putting much impact on the server. You can also peruse metadata like data types in sys.columns, foreign key relationships, constraints, and even calling code.

Over in R, we have the summary() function, which gives us a statistical summary of each column in a data frame. We can also use head(), which shows me the first few rows in a data frame. Here is an example of running a few summary functions against the bus table.

Getting basic summary information about the bus data set.

What we’ve learned here is that there are 700 buses (which you don’t see directly but can infer from the BusID quartiles). some of them came into service on 1990-01-01, which is probably a default value because our data set starts in 2011. Buses have retirement dates as well, and a bus without a retirement date is still active. In SQL, those would be rows with a NULL value; in R, those are the rows with NA values. There are 468 active buses in our current data set.

We would repeat this type of analysis for each of the major elements in our data set. Let me take a moment and cover my process for finding those “major elements” because I’m all too liable to gloss over it. Here’s what I tend to do when someone hands me a brand new database:

  1. Find the tables with the most rows in them. In SQL Server Management Studio, you can right-click on a database and navigate to Reports –> Standard Reports –> Disk Usage by Table. This will generate a Reporting Services report showing approximate row counts (based on statistics), data size, and index size. Look at the top few tables there. Some of them are probably logging tables, which I tend to come back to later.
  2. Pick one of the large tables which appears to have user-relevant information and perform summary analysis, looking at the top few rows to determine what’s in that table. These large tables tend to be fact-style data, storing relevant measures customers care about.
  3. Recursively find foreign key constraints to other tables and look at those tables. These are often explanatory data, providing context to the fact-style data above. If you need help with recursively finding key constraints, I have a script (with bonus ASCII art) and a longer post as well. If your database has no foreign keys, there are still ways around it like looking at joins in the plan cache.
  4. Repeat step 2 until you run out of big tables. Then move on to medium-sized tables you have not already reviewed.

Creating a database diagram will also be helpful here, especially if foreign key constraints exist.

Now that we’ve summarily covered summary analysis, let’s jump into the next category: growth analysis.

Growth Analysis

Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.

For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.

SELECT
	c.CalendarYear,
	COUNT(*) AS NumberOfBuses
FROM dbo.Bus b
	INNER JOIN dbo.Calendar c
		ON b.DateFirstInService <= c.Date
		AND ISNULL(b.DateRetired, '2018-12-31') &gt;= c.Date
WHERE
	c.CalendarDayOfYear = 1
	AND c.CalendarYear &gt;= 2011
	AND c.CalendarYear < 2019
GROUP BY
	c.CalendarYear
ORDER BY
	c.CalendarYear;

I can show you the SQL results but let’s drop this into R and build a quick and dirty plot.

options(repr.plot.width=6, repr.plot.height=4) 
ggplot(activeBuses, aes(x = CalendarYear, y = NumberOfBuses)) +
    geom_point() +
    geom_line() +
    labs(x = "Calendar Year", y = "Number of Buses", title = "Number of Buses by Year") +
    ylim(0, 500) +
    theme_minimal()

The first line with options is something I do with Jupyter to prevent it from creating huge graphs. From there, we’re creating a scatterplot with a line overlaid, labeling the axes, starting from 0, and using a minimalistic theme. Note that starting from 0 is not required—both line charts and scatter plots can start from points other than 0. I did this to make the steady growth more apparent.

The wheels on the bus go round and round at quite a steady clip.

Next, I want to look at the number of invoices per year. We invoice on a per-bus, per-item basis, so I would expect invoice growth to track bus growth reasonably closely. You can argue about economies of scale (maintenance workers become more efficient, you might get bigger discounts on parts, it makes sense to purchase capital equipment to make the per-bus maintenance fees lower, those types of things) but with the bus count growing so steadily, I’d think that invoices would grow similarly. So let’s take a look.

SELECT
	c.CalendarYear,
	COUNT(*) AS NumberOfInvoices
FROM dbo.LineItem li
	INNER JOIN dbo.Calendar c
		ON li.LineItemDate = c.Date
GROUP BY
	c.CalendarYear
ORDER BY
	c.CalendarYear;

Here is the R code:

ggplot(invoicesPerYear, aes(x = CalendarYear, y = NumberOfInvoices)) +
    geom_point() +
    geom_line() +
    labs(x = "Calendar Year", y = "Number of Invoices", title = "Number of Invoices by Year") +
    theme_minimal()

And the plot:

Hmm, nope, nothing weird at all here.

You can see that invoice growth was fairly steady from 2011 through 2017. Yeah, there are ups and downs but that’s normal in any real data set. The jump in 2018, however, is huge: we’ve effectively doubled the number of invoices despite bus growth being steady. Here’s the plot for expenditures by year, which code I’ve left out for the sake of making you do your own dirty work:

What’re a million dollars between friends?

Those extra invoices added about a million dollars over expectations. This is our first indication that something interesting has happened. Note that this is not evidence of fraud, as there can be a number of innocent explanations: maybe the buses need to go through more maintenance because they’re older, maybe regulatory requirements forced more work on the buses, maybe we got a batch of lemons which need more work done on them. There are plenty of potential causes, but this is well outside the realm of noise.

We’ll shelve this for a little bit and look at our next topic, gap analysis.

Gap Analysis

Gap analysis is something you’d typically run when you care about the lack of a value. For example, accountants tend to get picky about check numbers and invoice numbers being complete. If you go from check 1001 to 1002 to 1004, an accountant wants to know what happened to check 1003. The reason is that if you don’t have a record of 1003, it’s possible that there was embezzlement.

To perform a quick gap analysis on line items, we can use the LEAD() window function, available since SQL Server 2012. Here’s an example of the window function in action:

WITH C AS
(
	SELECT
		li.LineItemID AS CurrentLineItemID,
		LEAD(li.LineItemID) OVER (ORDER BY li.LineItemID) AS NextLineItemID
	FROM dbo.LineItem li
)
SELECT
	CurrentLineItemID + 1 AS rangestart,
	NextLineItemID- 1 AS rangeend
FROM C
WHERE
	NextLineItemID - CurrentLineItemID > 1;

Here’s what we get back:

Ranges of missing values

We have several ranges of missing values here, which is a bit concerning, as our invoice numbers should be a complete set. There might be an innocuous reason for this. If we look at sys.columns, we can see that LineItemID is an identity column.

Identity columns are great for auto-incrementing surrogate keys but are less great for auto-incrementing keys with informational context. Let me explain what I mean. If we have line items from 1 to 1000 in the table, the next row we insert will have an ID of 1001 (assuming nobody has changed the seed and our increment value is 1). But what happens if we get an error trying to insert value 1001 and need to roll back the statement? In that case, the value 1001 has been burned and our next insertion attempt will be 1002. This can leave gaps in our data for a totally innocuous reason and without anybody actually knowing.

The same applies to sequence types: it is possible that you fail to insert using a sequence value and might lose that value forever. If you need to track a value like an invoice number, your best bet might be to gin up your own solution. You can create a table which stores your latest used value. Then, when it’s time to use the next value, go into the serializable transaction isolation level and take a lock on the table by beginning a transaction and selecting the value from the table. That will prevent anybody else from using the table and potentially grabbing the same invoice number as you.

In your insertion code, you can then increment the value, insert into the table, and if your operation was successful, update the value in the latest value table. Then close the transaction so other sessions can do their work.

This answer works in a low-throughput situation where you don’t expect more than one or two updates every few seconds. Fortunately, most systems which require this level of scrutiny tend to be fairly low-throughput or at least have relatively low concurrency. A process like generating checks for tens of thousands of employees has periods of high throughput but if you batch it all in one transaction on one session, the problem is still tractable.

Duplicates

I’m going to gloss over duplicates here because I’ll get into it in much more detail when we talk about cohort analysis later. For now, here are a few things I’d like to put in your mind.

What is a Duplicate?

There are two different ways we can think about duplicate data. The first way is exact matches on relevant columns where there is no unique key constraint preventing duplication. Suppose we have a LineItemID (which is just a surrogate key) and an InvoiceNumber on our table. That invoice number should be contiguous and unique for each line item. If we don’t have a unique key constraint on that table, however, it becomes possible for someone to use the same invoice number for two lines.

The other side of a duplicate is something which ought to be the same but isn’t, maybe due to a typo. My favorite example of this happens to come from a bank fraud case from a few years back:

When the Federal Reserve Bank of New York cleared five transactions made by the Bangladesh Bank hackers, the money went in two directions. On Thursday, Feb. 4, the Fed’s system sent $20 million to Sri Lanka and $81 million to the Philippines.

The Sri Lankan transaction contained a small but crucial error: The money was being sent to a bank account in the name of a nonprofit foundation, but the electronic message spelled it “fundation.” That prompted Deutsche Bank, an intermediary in the transaction, and a Sri Lankan bank to contact Bangladesh Bank, which led to the payment being cancelled and the money returned.

Here, “foundation” and “fundation” were supposed to be the same but a small typo made a big difference.

Duplicates and Fraud

In the Wake County fraud case, one measure of duplication is the number of invoices received on a single day. We can’t have a unique key on date and vendor (or date, vendor, and bus in our case) because it’s completely reasonable for a customer, on occasion, to send two invoices on the same day. In the Wake County case, however, they had 24 separate days with at least 50 invoices. 50 goes beyond reasonable.

Regression Analysis

I’m not going to be able to give much more than a primer here. Regression analysis is the topic of many a book and course in statistics and getting regression right can be a major time sink. Acknowledging that we will remain superficial here, we can still cover some of the ground. In its most basic form, regression is all about determining if there is a relationship between one or more input variables (also known as independent variables) and our output (the dependent variable).

We saw the line graph of invoices by year and of buses by year. My question is how much the number of buses ends up driving the number of invoices. My expectation is that the number of buses is a key factor in the number of invoices we deal with: as we add new buses to the fleet, I’d expect an approximately linear increase in the amount of maintenance work to perform, as well as the number of parts to purchase. We may see fluctuations but I expect to see a trend.

Regression by Month and Year

The first thing I want to do is regress the number of invoices versus buses using monthly data. My thought here is that the number of buses drives the monthly number of invoices and that the number of invoices grows approximately linearly with the number of buses. Let’s try these out.

First, I have my SQL query that I use to populate a data frame:

WITH buses AS
(
	SELECT
		c.FirstDayOfMonth,
		c.CalendarMonth,
		c.CalendarYear,
		COUNT(*) AS NumberOfBuses
	FROM dbo.Bus b
		INNER JOIN dbo.Calendar c
			ON b.DateFirstInService <= c.Date
			AND ISNULL(b.DateRetired, '2018-12-31') &gt;= c.Date
	WHERE
		c.Date = c.FirstDayOfMonth
		AND c.CalendarYear &gt;= 2011
		AND c.CalendarYear < 2019
	GROUP BY
		c.FirstDayOfMonth,
		c.CalendarMonth,
		c.CalendarYear
),
expenses AS
(
	SELECT
		c.FirstDayOfMonth,
		COUNT(*) AS NumberOfInvoices,
		SUM(li.Amount) AS TotalInvoicedAmount
	FROM dbo.LineItem li
		INNER JOIN dbo.Calendar c
			ON li.LineItemDate = c.Date
	GROUP BY
	c.FirstDayOfMonth
)
SELECT
	b.FirstDayOfMonth,
	b.CalendarMonth,
	b.CalendarYear,
	b.NumberOfBuses,
	e.NumberOfInvoices,
	e.TotalInvoicedAmount
FROM buses b
	INNER JOIN expenses e
		ON b.FirstDayOfMonth = e.FirstDayOfMonth
ORDER BY
	b.FirstDayOfMonth;

Then, I’d like to build a regression. Here is the R code for an Ordinary Least Squares linear regression:

regICPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses,
                 data = filter(expenditures, lubridate::year(FirstDayOfMonth) < 2018))
summary(regICPre2018)

In one function call, I get my linear regression which focuses on tying the number of invoices to the number of buses. I should note that my data is a filter where the date is earlier than 2018. We saw the big jump in invoices in 2018 and that ruins our results. Because I think something’s odd about that data, I’d like to see what it looks like if we factor out 2018 and look at 2011 through 2017. Here’s what I get back:

Summary results from our first linear regression attempt.

There are a couple of things to pick out of this. First, our R^2 is 0.45, so we are explaining 45% of the variance in NumberOfInvoices. That’s okay but really not that good. In social science contexts, explaining 45% of human behavior is a really good result. But here we’re explaining expenditures and I’d much rather see 85-95% of the variance explained before I think an expenses model is accurate.

One thing we can do to try to improve the regression is to add features.

Adding Features to Regression

We have two additional features at hand: calendar month and calendar year. Let’s try calendar month first:

regICPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses + CalendarMonth,
                 data = filter(expenditures, lubridate::year(FirstDayOfMonth) < 2018))
summary(regICPre2018)
Regression 2: Regress Harder.

The R^2 didn’t move much at all—it went from 45% to 46%. Frankly, that’s noise. At this level, if we’re not seeing a 10% bump (or more) in R^2, I don’t know if I want to include that feature. Notice also that calendar month is not significant according to p-value. We can and should make fun of p values as much as possible, but here’s a case where the results are clear and sensible. Calendar month isn’t a factor in this regression. So let’s remove it and try calendar year.

regICPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses + CalendarYear,
                 data = filter(expenditures, lubridate::year(FirstDayOfMonth) < 2018))
summary(regICPre2018)
Regression 3: With a Vengeance.

Now this result is quite interesting. Our R^2 didn’t change but now neither variable is significant! This is a great example of something called multicollinearity, one of the challenges of regression. Put simply, the number of buses increases by about the same number every year, so there is very high correlation between number of buses and calendar year. Running a correlation test against the two, I end up with a value of 0.978.

That is, 97.9% of the variance reflected in buses is also reflected in year. These two variables are co-linear. Because these two variables move almost 1 for 1, it is difficult for the regression algorithm to separate behavior in one versus the other. They’re both fighting to explain the same variance and so both end up with higher p-values. Also of interest is that the R^2 doesn’t change. Multicollinearity doesn’t make your overall predictions worse, but it does make it tougher to tell which independent variables are driving the change.

This is an extreme scenario, mind you, but mutlicollinearity is a common enough occurrence that you will want to be on the lookout for it. The other linear regression amigos are serial correlation (AKA autocorrelation) and heteroskedasticity (my favorite of the three).

Now let’s take a step back, as we’re not getting the job done with regressing at the month level. Instead of grouping by month, I’ve changed the SQL query to include just calendar year and number of buses / invoices. Let’s see how that looks:

regICAnnualPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses,
                 data = filter(annualExpenditures, CalendarYear < 2018))
summary(regICAnnualPre2018)

I didn’t include the SQL code because it’s a trivial variant on the prior version. Yet I included the trivial variants on the R code because that’s how I roll. Here are my results:

Regression 4: Live Free or Regress Hard.

Wow. We went from explaining less than half of all variance to explaining 97% of the variance. That’s a huge difference and is definitely an interesting result. For a fairly mechanical problem like this one, an R^2 of .97 is high but not “shake your head” high. If this were a social sciences problem and I got an R^2 of .97, I’d wonder what I did wrong.

I don’t like that I have so few data points, but even with the low number of data points, our regression output is indicating that there’s something there. We can also run plot(regICAnnualPre2018 and see that our residuals are both positive and negative and a small percentage of the total values:

The measure of our mismeasurement.

What this tells us is that we do not see the residuals (that is, estimated – actual) consistently above or below 0, but rather spread out between them. If we saw the residuals consistently over (or under) 0, the residuals would show bias, which can be a problem when performing a regression analysis.

Finally, now that we have a good fit for the pre-2018 data, let’s see what adding 2018 does:

regICAnnual <- lm(formula = NumberOfInvoices ~ NumberOfBuses,
                 data = annualExpenditures)
summary(regICAnnual)
Regression 5: A Good Day to Regress Hard.

That’s a drop from 97% to 71%. It’s a huge drop. If we have no suspicions about data quality, that kind of drop can be devastating to us: it means our model is no longer a great model. But I do harbor some suspicions because 2018’s values are so much larger that I think there’s something weird going on.

One last note, we can take the annual pre-2018 model and generate a prediction to see what our model thinks 2018’s value ought to have been:

predict(regICAnnualPre2018, newdata = filter(annualExpenditures, CalendarYear == 2018))

This returns 5362 versus our actual invoice count of 7700. That’s a difference of more than 2000. Again, this isn’t proof of wrongdoing but it helps us put into perspective the scope of what’s going on. It’s a data point that maybe something weird is going on and this is the scale of that weirdness.

Conclusion

In this post, we looked at a number of analytical techniques to gain insight into our data. We focused mostly on high-level aggregates here, which can help us get a basic understanding of our data. In the next post, we’re going to move to another level of analysis: cohort analysis. This will give us a better idea of just what’s going on with our data.

Advertisements

SQL Saturday Raleigh

SQL Saturday Raleigh 2019 is now live. It will take place on Saturday, April 27th at Wake Technical Community College’s RTP campus.

For the past three events, we hosted at William Peace University, which I absolutely enjoyed having as a venue. But with Wake Tech opening a campus just a couple of miles from the center of the Triangle, we wanted to make the trip more convenient for people living out in Chapel Hill and Durham.

The venue is gorgeous, a brand new building which looks fantastic. You should definitely submit a talk or four and come to the best SQL Saturday in the United States…that day…unless someone else launches one after us…

SQL Server ML Services Fills The Plan Cache

UPDATED 2018-03-13:  SQL Server 2017 CU4 fixes this issue.  See below.

We call SQL Server ML Services a lot.  As in building hundreds of thousands of times a day to build models.   It turns out that doing this has a negative effect:  ML Services plans end up staying in the plan cache and don’t get removed.  Here’s how our plan cache looks:

Plan Cache

A plan cache.  Precipitous drops are predicated by service restarts.

What happens is that things work fine for a while, until our plan cache hits about 70 GB, after which point we start getting RESOURCE_SEMAPHORE waits on some of our queries and the available space for buffer pool drops to single-digit gigabytes.

This is a problem on SQL Server 2016 and SQL Server 2017.  It’s very unlikely to affect most people, as most people don’t do crazy stuff at this scale.  But hey, what’s the fun in  having a server of my own if I can’t bring it to its knees every once in a while?

Non-Solutions

The first thing that you might do here is try to run something like DBCC FREEPROCCACHE or maybe DBCC FREESYSTEMCACHE('SQL Plans') WITH MARK_IN_USE_FOR_REMOVAL; but neither of those did anything. It appears that R/ML Services plans are not marked for removal and will not clear, no matter how many times you try to flush the cache.

Workaround

For now, the workaround I have is to restart the SQL Server service occasionally. You can see that I have done it twice in the above screenshot. Our application is resilient to short database downtimes, so this isn’t a bad workaround for us; it’s just a little bit of an annoyance.

One thing to keep in mind if you are in this scenario is that if you are running ML Services hundreds of thousands of times a day, your ExtensibilityData folders might have a lot of cruft which may prevent the Launchpad service from starting as expected. I’ve had to delete all folders in \MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData\MSSQLSERVER01 after stopping the SQL Server service and before restarting it. The Launchpad service automatically does this, but if you have a huge number of folders in there, the service can time out trying to delete all of them.  In my experience at least, the other folders didn’t have enough sub-folders inside to make it worth deleting, but that may just be an artifact of how we use ML Services.

Solution

I have worked with Microsoft on the issue and they’re going to release a patch in a future SQL Server 2017 CU to fix this issue.  I’m not sure about SQL Server 2016 and also don’t know exactly when this patch will ship, but it’s working through the pipeline and I’m happy for that.

UPDATE:  2018-03-13

Microsoft has released SQL Server 2017 CU4, which fixes this buffer pool issue.  After the patch, my SQL plan cache has not grown beyond 2 GB after 4 days, whereas prior to the patch, it’d be in the 50-60 GB range by then.

Quick Thoughts On 50 SQL Saturdays

This post is a bit late, as I actually blew past 50 SQL Saturdays earlier in the year, but now that the year is over, I wanted to reflect just a little bit on why I enjoy speaking at SQL Saturdays.

I’ve spoken at 57 of them so far and want to break 75 in 2018.  Here’s the year-by-year breakdown:

  • 2013:  1
  • 2014:  3
  • 2015:  9
  • 2016:  22
  • 2017:  22

I love that the institution of SQL Saturday is popular enough that someone can attend 20+ events a year all all around the world.  In my case, the vast majority of my travel is inside the United States, but I get to see parts of the country that I otherwise couldn’t (or wouldn’t think to visit), and I like that.

If you’ve thought about speaking at a SQL Saturday, I highly recommend giving it a shot.  You don’t have to be a great speaker, and definitely don’t need to be a natural speaker in order to present.  That’s exactly the situation I was in back in 2013, when I gave my first SQL Saturday presentation.  That talk had a total of four attendees and it wasn’t a polished talk at all—which is a nice way of saying that even my recollection of the talk was that it wasn’t very good…though at least I still had all 4 attendees at the end of the talk!  But like with everything else, you get better through practice and training.

You also don’t need to criss-cross the country; start with a local conference if there are any, or a nearby regional conference if you can get away with it.  If you speak at one a year, you’re still getting good experience presenting and helping share your knowledge with the community, as well as picking up additional information and potentially making great contacts and friends.

And if you happen to be in the southeast, you should submit for the Raleigh, North Carolina SQL Saturday on April 14th.

Presentation Review, 2017

It’s been another busy year for me presenting. Over the course of 2017, I gave a total of 50 talks at 42 events.  It’s been a lot of fun getting to travel around the world, hitting places as far apart as Vienna, Austria and Sydney, Australia.  I’m hoping to keep up this pace for next year as well.

Now, to look at the goals I had set for the year.  As a quick reminder of my 2017 goals:

  1. Speak at 20 SQL Saturdays and 10 user groups
  2. Speak at 2 paid conferences
  3. Give 6 webinars
  4. Do a full-length, pictures-only talk

I wasn’t quite as successful this year as I was last year.  Let’s see how I did:

Speak at 20 SQL Saturdays and 10 user groups

I ended up speaking at 22 SQL Saturdays this year (and would have been 23, had I not gotten sick the day before SQL Saturday Pittsburgh), so I beat that part of the goal.  As far as user groups go, I barely eeked out speaking at 10 distinct user groups.  So a big green checkmark for this goal.

Speak at 2 paid conferences

Another goal I ended up hitting.  I spoke at NDC Sydney in August, as well as the Mid-Atlantic SQL Server Conference and IT/Dev Connections in October.

Give 6 webinars

Missed it by that much.  I ended up doing 5 webinars this year.  I was pushing for them earlier in the year but slacked off in the middle and that made all the difference.

Do a full-length, pictures-only talk

I haven’t gotten that far yet.  I did end up doing a 10-minute talk which was dominated with pictures, but that’s not quite good enough to count.

On the plus side, I’ve been focusing on more graphics-heavy talks, so at least I’m moving toward a better equilibrium on that front.

Overall Thoughts

I know I made my 2017 goals pretty ambitious, so I’m happy that I was able to do two of them.  Doing more webinars is high on my agenda, particularly now that I have a decent recording setup.  And that will also let me create more videos of my talks.

OWASP Top 10 For 2017

I have been following the OWASP Top 10 for 2017 for a while and have decided to create a talk on the topic.  It was interesting watching the discussion on the OWASP GitHub repo, and now that the list appears to be settled, I can publish my talk.

This talk is meant to provide an overview of the OWASP Top 10 from a .NET developer’s perspective.  To support a 60-minute talk, I’m providing a large number of links to additional resources, as I know I can’t go in depth on any single topic.  This blog post will serve as my Links and Further Information for the talk, so here goes:

General Links

Injection

Authentication and Session Management

Sensitive Data Exposure

XML External Entity Injection

Building A Genetic Algorithm

This is part of a series entitled Genetics In Action.

So far, we have learned a little bit about evolutionary algorithms and taken a look at just enough high school biology to review the basics of genetics.  Today, I want to look at one particular type of evolutionary algorithm called a genetic algorithm.

Although John Holland did not invent the concept of genetic algorithms, he is the man most responsible for popularizing and developing the concept.  Holland’s Adaptation in Natural and Artificial Systems is a classic of the field and ties together the genetic metaphor.  I highly recommend this book if you’re interested in the topic.

Digging Into The Algorithm

In the last post, we looked at how the genetic metaphor ties development concepts to biological concepts, and today we will move beyond that high-level description and cover the specifics of how genetic algorithms work.  We will start by looking at the simplest type of genetic algorithm:  a single chromosome with a fixed number of genes, each of which has two alleles.  In computer science terms, think about a fixed-length array of boolean values.

GA

Two sample chromosomes

In the image above, we can see two sample chromosomes, each of which has eight genes.  We want to build up a population of these chromosomes at random—one of the interesting parts of the genetic algorithm is that it (usually) doesn’t matter where in the total search space we begin, so starting at a bunch of random points is just as good as anything else.

When it comes to choosing the size of the population, there aren’t too many hard-and-fast rules.  I have read recommendations that you should have at least 2 * N, where N is the number of genes that each chromosome has.  If you’re looking at 10-30 genes, I’ve typically had good luck with a population size of somewhere between 100 and 500.  You’ll find out that there is a maximum interesting population size, after which you don’t really get any benefit:  you won’t converge on a solution any faster, and it will take longer to do all of the processing.

Once we have our population, the next step is to score each organism in the population.  To score an organism, we apply a fitness function.  In computer science terms, this is usually an actual function, where we use the organism’s chromosomal representation as the inputs and generate and return a score for each chromosome.

Fitness

Scoring each organism

In the image above, we have defined a score for each organism.  This score is typically one of two things:  either it is the distance from an ideal point, or it is a valuation.  In the first case, think of a set of (x, y) coordinates.  We want to define a chromosome that, given an x coordinate, will generate its appropriate y coordinate.  We will calculate some distance between the predicted y and the actual y (for example, we could calculate the Root Mean Square Deviation), where a perfect match has a deviation score of 0.  On the other side, suppose that we can produce N separate products.  Each product has a cost and a price for sale.  Our genetic algorithm might describe the combination of goods we create, and the score would be the net margin (revenue minus cost) of those products.  In that case, a higher number is better for us.

It’s A Generational Thing

Now that we have the basic idea of a fitness score behind us, let’s go to the next step:  making babies.

Parents

We use a complicated matrix with dozens of measures to find the best fit

Now I am showing the entire population, which has four members.  Each member of the population has its own score, and we will use those scores to help us figure out the next generation of organisms.  The mechanism I am showing in the picture above is the simplest mechanism for genetic algorithms, which is the roulette wheel selection.  Basically, take the fitness values for each member of the population and you get a total score—that’s the 508 above.  Figure out each member’s percentage of the score, and you have a set of values which sums up to 1.  Pick a random number between 0 and 1, and wherever you land on the cumulative distribution function, you have your choice of parent.  Note that you draw with replacement, meaning that you can pull the same organism more than once.

To motivate this example, let’s suppose that red owns numbers from 0 up to .335, blue owns numbers from .335 up to .587, green owns .587 until .815, and yellow owns .815 through 1.0.  Our first random number drawn is .008, so the lucky winner is red.  Then, we draw a second parent and pulled .661, which happens to be squarely in green territory.  We now have our two parents.

Crossover

Now that we have our two parents, we are going to generate two children.  I need to introduce a new concept:  crossover.

Crossover

When a mommy chromosome loves a daddy chromosome very much…

Crossover is the recombination of a segment of a chromosome.  In the example above, we are switching genes 3-5 from each of the parents for each of the children (though child #2 is shy and is hiding off-camera).

This action is part of the genius behind genetic algorithms.  We’ve already taken some of the fittest organisms (in a large population, we’re going to pull successful organisms much more frequently than unfit organisms, and there are other pulling techniques which bias toward successful chromosomes even more than roulette wheel), and by recombining slices of their genes, we are able to test the waters with new combinations to see if we can find something even more successful.

Of course, there’s no guarantee that the new combination will be more successful than its parents were, so we have a concept known as the crossover percentage.  That is, we only perform crossover a certain percentage of the time.  In practice, this is often anywhere from 60% to 90%.  If we don’t perform crossover, then the two chromosomes just mosey on into the next step of the process.  But if we do roll the dice and land on the ol’ chop-and-swap, then we have two more RNG rounds to play.

The first of these bonus random number pulls determines where we start the chop, and the second pull determines where we stop.  In the picture above, we start at gene 3 and end at gene 5, inclusive.  In genetic algorithms, we typically have fixed-size chromosomes (though that’s not always true!) and therefore symmetrical swaps.

Turtle Power

The last step in our genetic factory floor is mutation.  One problem with survival-of-the-fittest is that, especially in later generations, we might run low on genetic diversity.  At an extreme, we end up with a population full of exactly the same chromosome, so no matter how you slice them, you get no novel patterns.  If we’ve reached the global maximum, that’s acceptable, but what if we ended up at only a local maximum and can’t jump over to the global max?  That’s where mutation comes into play.

Mutation

Not pictured:  martial artist rat or antagonistic, anthropomorphic pig & rhino combo

Mutation works by modifying a particular gene’s value.  For each gene in each new chromosome, mutate with a probability p.  Usually p is somewhere between 0.001% and 1%, though I’ve read papers that argue in certain circumstances, you might need a mutation rate of 20% to 30%.  Those would be cases with very flat local minima/maxima where you can get trapped in that flat spot and need a kick out.

If you want a fitting metaphor for flat spots, I had an old Toyota Corolla with an old starter.  I’d be able to start the car up successfully several times in a row, but then I’d hit a dead spot in the flywheel and it just wouldn’t start.  Eventually, my dad bought me the Official Starter Repair Kit:  a crowbar.  His advice was to apply sufficient percussive force until I nudged the starter out of its dead spot, and then the car could start successfully.  Mutation provides benefits in much the same way.  And just like my beater bar, mutation is not a technique you want to rely upon constantly.  At the extreme, mutation is just random search, losing all of the important information that a genetic algorithm learns during the process.

Finishing The Process

At this point, we have a finished organism.

NewOrganism

All grown up and ready to take on the world

We do this for each slot in the population and then repeat the process:  score, choose, cross over (or not), mutate.  We have a few potential end conditions.  Some of them are:

  1. Stop after a fixed number of generations
  2. Stop after we reach a known ideal solution (e.g., 0 distance from the actual values)
  3. Stop after we get close enough to a known ideal solution
  4. Stop after we have stasis for a certain number of generations, where we have not improved the fitness score for the best agent in a while
  5. Stop after a certain amount of time

There are other stop conditions as well, but these are the most common.

Conclusion

Today we covered the basics of genetic algorithms and how the process works.  Tomorrow, we’ll look at using a genetic algorithm library to solve different types of problems.