2019 Year In Review, Part 1

This will be a review of my presentation goals for 2019. Tomorrow, I’m going to talk about a bunch of other stuff I did. It’s been a busy year.

Coming into the year, I had 5 major goals and 2 stretch goals. Here’s how I did on each of them.

Speak at 20 SQL Saturdays and 10 User Groups

I ended up hitting 20 SQL Saturdays on the nose. As far as user groups go, I had 13 user group presentations, including two remote. One successful goal down.

Speak in 4 Countries

I ended up speaking at events in six countries: the United States, Canada, Germany, Ireland, the United Kingdom, and Australia. Two successful goals down.

Update 4 Older Talks to Use GitPitch

I’ve become a lot more comfortable with GitPitch (to the point where I’m paying for the full product, as I want it around for a while). At the beginning of the year, I listed out a half-dozen talks which would be good to migrate to GitPitch. I moved four of them, rewrote a fifth (on PolyBase), and will probably retire the 6th at some point. Speaking of which…

Retire 4 Talks

Sometimes, it’s good to put a talk out to pasture. I ended up giving up on a few of them this year, though funny enough, I brought my talk on Biml out of retirement due to internal demand.

I missed on my two stretch goals, but did end up hitting all of my main goals, so I’ve got that going for me. Tomorrow, I get to brag about all of the stuff I was doing while missing those stretch goals…

Forensic Accounting: Basic Analysis

This is part three in a series on applying forensic accounting techniques to SQL and R.

Getting Exploratory

The term “basic analysis” is pretty broad, so we’re going to cover a few concepts today. Most of this is in the realm of Exploratory Data Analysis, looking at the data to gain a better understanding of what you have before you generate hypotheses. This sounds bland and straightforward today, which is an indication of how well Tukey’s book was received and how insightful he was.

In today’s post, we’ll look at exploring our data along five avenues: summary analysis, growth analysis, gaps in data, duplicates and cardinality, and regression analysis. Most of this work will be in R, using the DBI package to connect to SQL Server.

Summary Analysis

Summary analysis is the easiest of these and generally should be your starting point. Here, you simply want to get an idea of the shape of your data. If you’re in SQL, I tend to start with a SELECT TOP(100) * from the table so I can get an idea of columns, value spread, and data lengths without putting much impact on the server. You can also peruse metadata like data types in sys.columns, foreign key relationships, constraints, and even calling code.

Over in R, we have the summary() function, which gives us a statistical summary of each column in a data frame. We can also use head(), which shows me the first few rows in a data frame. Here is an example of running a few summary functions against the bus table.

Getting basic summary information about the bus data set.

What we’ve learned here is that there are 700 buses (which you don’t see directly but can infer from the BusID quartiles). some of them came into service on 1990-01-01, which is probably a default value because our data set starts in 2011. Buses have retirement dates as well, and a bus without a retirement date is still active. In SQL, those would be rows with a NULL value; in R, those are the rows with NA values. There are 468 active buses in our current data set.

We would repeat this type of analysis for each of the major elements in our data set. Let me take a moment and cover my process for finding those “major elements” because I’m all too liable to gloss over it. Here’s what I tend to do when someone hands me a brand new database:

  1. Find the tables with the most rows in them. In SQL Server Management Studio, you can right-click on a database and navigate to Reports –> Standard Reports –> Disk Usage by Table. This will generate a Reporting Services report showing approximate row counts (based on statistics), data size, and index size. Look at the top few tables there. Some of them are probably logging tables, which I tend to come back to later.
  2. Pick one of the large tables which appears to have user-relevant information and perform summary analysis, looking at the top few rows to determine what’s in that table. These large tables tend to be fact-style data, storing relevant measures customers care about.
  3. Recursively find foreign key constraints to other tables and look at those tables. These are often explanatory data, providing context to the fact-style data above. If you need help with recursively finding key constraints, I have a script (with bonus ASCII art) and a longer post as well. If your database has no foreign keys, there are still ways around it like looking at joins in the plan cache.
  4. Repeat step 2 until you run out of big tables. Then move on to medium-sized tables you have not already reviewed.

Creating a database diagram will also be helpful here, especially if foreign key constraints exist.

Now that we’ve summarily covered summary analysis, let’s jump into the next category: growth analysis.

Growth Analysis

Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.

For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.

	COUNT(*) AS NumberOfBuses
FROM dbo.Bus b
	INNER JOIN dbo.Calendar c
		ON b.DateFirstInService <= c.Date
		AND ISNULL(b.DateRetired, '2018-12-31') >= c.Date
	c.CalendarDayOfYear = 1
	AND c.CalendarYear >= 2011
	AND c.CalendarYear < 2019

I can show you the SQL results but let’s drop this into R and build a quick and dirty plot.

options(repr.plot.width=6, repr.plot.height=4) 
ggplot(activeBuses, aes(x = CalendarYear, y = NumberOfBuses)) +
    geom_point() +
    geom_line() +
    labs(x = "Calendar Year", y = "Number of Buses", title = "Number of Buses by Year") +
    ylim(0, 500) +

The first line with options is something I do with Jupyter to prevent it from creating huge graphs. From there, we’re creating a scatterplot with a line overlaid, labeling the axes, starting from 0, and using a minimalistic theme. Note that starting from 0 is not required—both line charts and scatter plots can start from points other than 0. I did this to make the steady growth more apparent.

The wheels on the bus go round and round at quite a steady clip.

Next, I want to look at the number of invoices per year. We invoice on a per-bus, per-item basis, so I would expect invoice growth to track bus growth reasonably closely. You can argue about economies of scale (maintenance workers become more efficient, you might get bigger discounts on parts, it makes sense to purchase capital equipment to make the per-bus maintenance fees lower, those types of things) but with the bus count growing so steadily, I’d think that invoices would grow similarly. So let’s take a look.

	COUNT(*) AS NumberOfInvoices
FROM dbo.LineItem li
	INNER JOIN dbo.Calendar c
		ON li.LineItemDate = c.Date

Here is the R code:

ggplot(invoicesPerYear, aes(x = CalendarYear, y = NumberOfInvoices)) +
    geom_point() +
    geom_line() +
    labs(x = "Calendar Year", y = "Number of Invoices", title = "Number of Invoices by Year") +

And the plot:

Hmm, nope, nothing weird at all here.

You can see that invoice growth was fairly steady from 2011 through 2017. Yeah, there are ups and downs but that’s normal in any real data set. The jump in 2018, however, is huge: we’ve effectively doubled the number of invoices despite bus growth being steady. Here’s the plot for expenditures by year, which code I’ve left out for the sake of making you do your own dirty work:

What’re a million dollars between friends?

Those extra invoices added about a million dollars over expectations. This is our first indication that something interesting has happened. Note that this is not evidence of fraud, as there can be a number of innocent explanations: maybe the buses need to go through more maintenance because they’re older, maybe regulatory requirements forced more work on the buses, maybe we got a batch of lemons which need more work done on them. There are plenty of potential causes, but this is well outside the realm of noise.

We’ll shelve this for a little bit and look at our next topic, gap analysis.

Gap Analysis

Gap analysis is something you’d typically run when you care about the lack of a value. For example, accountants tend to get picky about check numbers and invoice numbers being complete. If you go from check 1001 to 1002 to 1004, an accountant wants to know what happened to check 1003. The reason is that if you don’t have a record of 1003, it’s possible that there was embezzlement.

To perform a quick gap analysis on line items, we can use the LEAD() window function, available since SQL Server 2012. Here’s an example of the window function in action:

		li.LineItemID AS CurrentLineItemID,
		LEAD(li.LineItemID) OVER (ORDER BY li.LineItemID) AS NextLineItemID
	FROM dbo.LineItem li
	CurrentLineItemID + 1 AS rangestart,
	NextLineItemID- 1 AS rangeend
	NextLineItemID - CurrentLineItemID > 1;

Here’s what we get back:

Ranges of missing values

We have several ranges of missing values here, which is a bit concerning, as our invoice numbers should be a complete set. There might be an innocuous reason for this. If we look at sys.columns, we can see that LineItemID is an identity column.

Identity columns are great for auto-incrementing surrogate keys but are less great for auto-incrementing keys with informational context. Let me explain what I mean. If we have line items from 1 to 1000 in the table, the next row we insert will have an ID of 1001 (assuming nobody has changed the seed and our increment value is 1). But what happens if we get an error trying to insert value 1001 and need to roll back the statement? In that case, the value 1001 has been burned and our next insertion attempt will be 1002. This can leave gaps in our data for a totally innocuous reason and without anybody actually knowing.

The same applies to sequence types: it is possible that you fail to insert using a sequence value and might lose that value forever. If you need to track a value like an invoice number, your best bet might be to gin up your own solution. You can create a table which stores your latest used value. Then, when it’s time to use the next value, go into the serializable transaction isolation level and take a lock on the table by beginning a transaction and selecting the value from the table. That will prevent anybody else from using the table and potentially grabbing the same invoice number as you.

In your insertion code, you can then increment the value, insert into the table, and if your operation was successful, update the value in the latest value table. Then close the transaction so other sessions can do their work.

This answer works in a low-throughput situation where you don’t expect more than one or two updates every few seconds. Fortunately, most systems which require this level of scrutiny tend to be fairly low-throughput or at least have relatively low concurrency. A process like generating checks for tens of thousands of employees has periods of high throughput but if you batch it all in one transaction on one session, the problem is still tractable.


I’m going to gloss over duplicates here because I’ll get into it in much more detail when we talk about cohort analysis later. For now, here are a few things I’d like to put in your mind.

What is a Duplicate?

There are two different ways we can think about duplicate data. The first way is exact matches on relevant columns where there is no unique key constraint preventing duplication. Suppose we have a LineItemID (which is just a surrogate key) and an InvoiceNumber on our table. That invoice number should be contiguous and unique for each line item. If we don’t have a unique key constraint on that table, however, it becomes possible for someone to use the same invoice number for two lines.

The other side of a duplicate is something which ought to be the same but isn’t, maybe due to a typo. My favorite example of this happens to come from a bank fraud case from a few years back:

When the Federal Reserve Bank of New York cleared five transactions made by the Bangladesh Bank hackers, the money went in two directions. On Thursday, Feb. 4, the Fed’s system sent $20 million to Sri Lanka and $81 million to the Philippines.

The Sri Lankan transaction contained a small but crucial error: The money was being sent to a bank account in the name of a nonprofit foundation, but the electronic message spelled it “fundation.” That prompted Deutsche Bank, an intermediary in the transaction, and a Sri Lankan bank to contact Bangladesh Bank, which led to the payment being cancelled and the money returned.

Here, “foundation” and “fundation” were supposed to be the same but a small typo made a big difference.

Duplicates and Fraud

In the Wake County fraud case, one measure of duplication is the number of invoices received on a single day. We can’t have a unique key on date and vendor (or date, vendor, and bus in our case) because it’s completely reasonable for a customer, on occasion, to send two invoices on the same day. In the Wake County case, however, they had 24 separate days with at least 50 invoices. 50 goes beyond reasonable.

Regression Analysis

I’m not going to be able to give much more than a primer here. Regression analysis is the topic of many a book and course in statistics and getting regression right can be a major time sink. Acknowledging that we will remain superficial here, we can still cover some of the ground. In its most basic form, regression is all about determining if there is a relationship between one or more input variables (also known as independent variables) and our output (the dependent variable).

We saw the line graph of invoices by year and of buses by year. My question is how much the number of buses ends up driving the number of invoices. My expectation is that the number of buses is a key factor in the number of invoices we deal with: as we add new buses to the fleet, I’d expect an approximately linear increase in the amount of maintenance work to perform, as well as the number of parts to purchase. We may see fluctuations but I expect to see a trend.

Regression by Month and Year

The first thing I want to do is regress the number of invoices versus buses using monthly data. My thought here is that the number of buses drives the monthly number of invoices and that the number of invoices grows approximately linearly with the number of buses. Let’s try these out.

First, I have my SQL query that I use to populate a data frame:

WITH buses AS
		COUNT(*) AS NumberOfBuses
	FROM dbo.Bus b
		INNER JOIN dbo.Calendar c
			ON b.DateFirstInService <= c.Date
			AND ISNULL(b.DateRetired, '2018-12-31') >= c.Date
		c.Date = c.FirstDayOfMonth
		AND c.CalendarYear >= 2011
		AND c.CalendarYear < 2019
expenses AS
		COUNT(*) AS NumberOfInvoices,
		SUM(li.Amount) AS TotalInvoicedAmount
	FROM dbo.LineItem li
		INNER JOIN dbo.Calendar c
			ON li.LineItemDate = c.Date
FROM buses b
	INNER JOIN expenses e
		ON b.FirstDayOfMonth = e.FirstDayOfMonth

Then, I’d like to build a regression. Here is the R code for an Ordinary Least Squares linear regression:

regICPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses,
                 data = filter(expenditures, lubridate::year(FirstDayOfMonth) < 2018))

In one function call, I get my linear regression which focuses on tying the number of invoices to the number of buses. I should note that my data is a filter where the date is earlier than 2018. We saw the big jump in invoices in 2018 and that ruins our results. Because I think something’s odd about that data, I’d like to see what it looks like if we factor out 2018 and look at 2011 through 2017. Here’s what I get back:

Summary results from our first linear regression attempt.

There are a couple of things to pick out of this. First, our R^2 is 0.45, so we are explaining 45% of the variance in NumberOfInvoices. That’s okay but really not that good. In social science contexts, explaining 45% of human behavior is a really good result. But here we’re explaining expenditures and I’d much rather see 85-95% of the variance explained before I think an expenses model is accurate.

One thing we can do to try to improve the regression is to add features.

Adding Features to Regression

We have two additional features at hand: calendar month and calendar year. Let’s try calendar month first:

regICPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses + CalendarMonth,
                 data = filter(expenditures, lubridate::year(FirstDayOfMonth) < 2018))
Regression 2: Regress Harder.

The R^2 didn’t move much at all—it went from 45% to 46%. Frankly, that’s noise. At this level, if we’re not seeing a 10% bump (or more) in R^2, I don’t know if I want to include that feature. Notice also that calendar month is not significant according to p-value. We can and should make fun of p values as much as possible, but here’s a case where the results are clear and sensible. Calendar month isn’t a factor in this regression. So let’s remove it and try calendar year.

regICPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses + CalendarYear,
                 data = filter(expenditures, lubridate::year(FirstDayOfMonth) < 2018))
Regression 3: With a Vengeance.

Now this result is quite interesting. Our R^2 didn’t change but now neither variable is significant! This is a great example of something called multicollinearity, one of the challenges of regression. Put simply, the number of buses increases by about the same number every year, so there is very high correlation between number of buses and calendar year. Running a correlation test against the two, I end up with a value of 0.978.

That is, 97.9% of the variance reflected in buses is also reflected in year. These two variables are co-linear. Because these two variables move almost 1 for 1, it is difficult for the regression algorithm to separate behavior in one versus the other. They’re both fighting to explain the same variance and so both end up with higher p-values. Also of interest is that the R^2 doesn’t change. Multicollinearity doesn’t make your overall predictions worse, but it does make it tougher to tell which independent variables are driving the change.

This is an extreme scenario, mind you, but mutlicollinearity is a common enough occurrence that you will want to be on the lookout for it. The other linear regression amigos are serial correlation (AKA autocorrelation) and heteroskedasticity (my favorite of the three).

Now let’s take a step back, as we’re not getting the job done with regressing at the month level. Instead of grouping by month, I’ve changed the SQL query to include just calendar year and number of buses / invoices. Let’s see how that looks:

regICAnnualPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses,
                 data = filter(annualExpenditures, CalendarYear < 2018))

I didn’t include the SQL code because it’s a trivial variant on the prior version. Yet I included the trivial variants on the R code because that’s how I roll. Here are my results:

Regression 4: Live Free or Regress Hard.

Wow. We went from explaining less than half of all variance to explaining 97% of the variance. That’s a huge difference and is definitely an interesting result. For a fairly mechanical problem like this one, an R^2 of .97 is high but not “shake your head” high. If this were a social sciences problem and I got an R^2 of .97, I’d wonder what I did wrong.

I don’t like that I have so few data points, but even with the low number of data points, our regression output is indicating that there’s something there. We can also run plot(regICAnnualPre2018 and see that our residuals are both positive and negative and a small percentage of the total values:

The measure of our mismeasurement.

What this tells us is that we do not see the residuals (that is, estimated – actual) consistently above or below 0, but rather spread out between them. If we saw the residuals consistently over (or under) 0, the residuals would show bias, which can be a problem when performing a regression analysis.

Finally, now that we have a good fit for the pre-2018 data, let’s see what adding 2018 does:

regICAnnual <- lm(formula = NumberOfInvoices ~ NumberOfBuses,
                 data = annualExpenditures)
Regression 5: A Good Day to Regress Hard.

That’s a drop from 97% to 71%. It’s a huge drop. If we have no suspicions about data quality, that kind of drop can be devastating to us: it means our model is no longer a great model. But I do harbor some suspicions because 2018’s values are so much larger that I think there’s something weird going on.

One last note, we can take the annual pre-2018 model and generate a prediction to see what our model thinks 2018’s value ought to have been:

predict(regICAnnualPre2018, newdata = filter(annualExpenditures, CalendarYear == 2018))

This returns 5362 versus our actual invoice count of 7700. That’s a difference of more than 2000. Again, this isn’t proof of wrongdoing but it helps us put into perspective the scope of what’s going on. It’s a data point that maybe something weird is going on and this is the scale of that weirdness.


In this post, we looked at a number of analytical techniques to gain insight into our data. We focused mostly on high-level aggregates here, which can help us get a basic understanding of our data. In the next post, we’re going to move to another level of analysis: cohort analysis. This will give us a better idea of just what’s going on with our data.

SQL Saturday Raleigh

SQL Saturday Raleigh 2019 is now live. It will take place on Saturday, April 27th at Wake Technical Community College’s RTP campus.

For the past three events, we hosted at William Peace University, which I absolutely enjoyed having as a venue. But with Wake Tech opening a campus just a couple of miles from the center of the Triangle, we wanted to make the trip more convenient for people living out in Chapel Hill and Durham.

The venue is gorgeous, a brand new building which looks fantastic. You should definitely submit a talk or four and come to the best SQL Saturday in the United States…that day…unless someone else launches one after us…

SQL Server ML Services Fills The Plan Cache

UPDATED 2018-03-13:  SQL Server 2017 CU4 fixes this issue.  See below.

We call SQL Server ML Services a lot.  As in building hundreds of thousands of times a day to build models.   It turns out that doing this has a negative effect:  ML Services plans end up staying in the plan cache and don’t get removed.  Here’s how our plan cache looks:

Plan Cache

A plan cache.  Precipitous drops are predicated by service restarts.

What happens is that things work fine for a while, until our plan cache hits about 70 GB, after which point we start getting RESOURCE_SEMAPHORE waits on some of our queries and the available space for buffer pool drops to single-digit gigabytes.

This is a problem on SQL Server 2016 and SQL Server 2017.  It’s very unlikely to affect most people, as most people don’t do crazy stuff at this scale.  But hey, what’s the fun in  having a server of my own if I can’t bring it to its knees every once in a while?


The first thing that you might do here is try to run something like DBCC FREEPROCCACHE or maybe DBCC FREESYSTEMCACHE('SQL Plans') WITH MARK_IN_USE_FOR_REMOVAL; but neither of those did anything. It appears that R/ML Services plans are not marked for removal and will not clear, no matter how many times you try to flush the cache.


For now, the workaround I have is to restart the SQL Server service occasionally. You can see that I have done it twice in the above screenshot. Our application is resilient to short database downtimes, so this isn’t a bad workaround for us; it’s just a little bit of an annoyance.

One thing to keep in mind if you are in this scenario is that if you are running ML Services hundreds of thousands of times a day, your ExtensibilityData folders might have a lot of cruft which may prevent the Launchpad service from starting as expected. I’ve had to delete all folders in \MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData\MSSQLSERVER01 after stopping the SQL Server service and before restarting it. The Launchpad service automatically does this, but if you have a huge number of folders in there, the service can time out trying to delete all of them.  In my experience at least, the other folders didn’t have enough sub-folders inside to make it worth deleting, but that may just be an artifact of how we use ML Services.


I have worked with Microsoft on the issue and they’re going to release a patch in a future SQL Server 2017 CU to fix this issue.  I’m not sure about SQL Server 2016 and also don’t know exactly when this patch will ship, but it’s working through the pipeline and I’m happy for that.

UPDATE:  2018-03-13

Microsoft has released SQL Server 2017 CU4, which fixes this buffer pool issue.  After the patch, my SQL plan cache has not grown beyond 2 GB after 4 days, whereas prior to the patch, it’d be in the 50-60 GB range by then.

Quick Thoughts On 50 SQL Saturdays

This post is a bit late, as I actually blew past 50 SQL Saturdays earlier in the year, but now that the year is over, I wanted to reflect just a little bit on why I enjoy speaking at SQL Saturdays.

I’ve spoken at 57 of them so far and want to break 75 in 2018.  Here’s the year-by-year breakdown:

  • 2013:  1
  • 2014:  3
  • 2015:  9
  • 2016:  22
  • 2017:  22

I love that the institution of SQL Saturday is popular enough that someone can attend 20+ events a year all all around the world.  In my case, the vast majority of my travel is inside the United States, but I get to see parts of the country that I otherwise couldn’t (or wouldn’t think to visit), and I like that.

If you’ve thought about speaking at a SQL Saturday, I highly recommend giving it a shot.  You don’t have to be a great speaker, and definitely don’t need to be a natural speaker in order to present.  That’s exactly the situation I was in back in 2013, when I gave my first SQL Saturday presentation.  That talk had a total of four attendees and it wasn’t a polished talk at all—which is a nice way of saying that even my recollection of the talk was that it wasn’t very good…though at least I still had all 4 attendees at the end of the talk!  But like with everything else, you get better through practice and training.

You also don’t need to criss-cross the country; start with a local conference if there are any, or a nearby regional conference if you can get away with it.  If you speak at one a year, you’re still getting good experience presenting and helping share your knowledge with the community, as well as picking up additional information and potentially making great contacts and friends.

And if you happen to be in the southeast, you should submit for the Raleigh, North Carolina SQL Saturday on April 14th.

Presentation Review, 2017

It’s been another busy year for me presenting. Over the course of 2017, I gave a total of 50 talks at 42 events.  It’s been a lot of fun getting to travel around the world, hitting places as far apart as Vienna, Austria and Sydney, Australia.  I’m hoping to keep up this pace for next year as well.

Now, to look at the goals I had set for the year.  As a quick reminder of my 2017 goals:

  1. Speak at 20 SQL Saturdays and 10 user groups
  2. Speak at 2 paid conferences
  3. Give 6 webinars
  4. Do a full-length, pictures-only talk

I wasn’t quite as successful this year as I was last year.  Let’s see how I did:

Speak at 20 SQL Saturdays and 10 user groups

I ended up speaking at 22 SQL Saturdays this year (and would have been 23, had I not gotten sick the day before SQL Saturday Pittsburgh), so I beat that part of the goal.  As far as user groups go, I barely eeked out speaking at 10 distinct user groups.  So a big green checkmark for this goal.

Speak at 2 paid conferences

Another goal I ended up hitting.  I spoke at NDC Sydney in August, as well as the Mid-Atlantic SQL Server Conference and IT/Dev Connections in October.

Give 6 webinars

Missed it by that much.  I ended up doing 5 webinars this year.  I was pushing for them earlier in the year but slacked off in the middle and that made all the difference.

Do a full-length, pictures-only talk

I haven’t gotten that far yet.  I did end up doing a 10-minute talk which was dominated with pictures, but that’s not quite good enough to count.

On the plus side, I’ve been focusing on more graphics-heavy talks, so at least I’m moving toward a better equilibrium on that front.

Overall Thoughts

I know I made my 2017 goals pretty ambitious, so I’m happy that I was able to do two of them.  Doing more webinars is high on my agenda, particularly now that I have a decent recording setup.  And that will also let me create more videos of my talks.

OWASP Top 10 For 2017

I have been following the OWASP Top 10 for 2017 for a while and have decided to create a talk on the topic.  It was interesting watching the discussion on the OWASP GitHub repo, and now that the list appears to be settled, I can publish my talk.

This talk is meant to provide an overview of the OWASP Top 10 from a .NET developer’s perspective.  To support a 60-minute talk, I’m providing a large number of links to additional resources, as I know I can’t go in depth on any single topic.  This blog post will serve as my Links and Further Information for the talk, so here goes:

General Links


Authentication and Session Management

Sensitive Data Exposure

XML External Entity Injection