OWASP Top 10 For 2017

I have been following the OWASP Top 10 for 2017 for a while and have decided to create a talk on the topic.  It was interesting watching the discussion on the OWASP GitHub repo, and now that the list appears to be settled, I can publish my talk.

This talk is meant to provide an overview of the OWASP Top 10 from a .NET developer’s perspective.  To support a 60-minute talk, I’m providing a large number of links to additional resources, as I know I can’t go in depth on any single topic.  This blog post will serve as my Links and Further Information for the talk, so here goes:

General Links

Injection

Authentication and Session Management

Sensitive Data Exposure

XML External Entity Injection

Advertisements

T-SQL Tuesday 95: Polybase And HDInsight

tt_300_11FF5A1DToday is T-SQL Tuesday #95, hosted by Derik Hammer, and the topic is Big Data.

This is a continuation of my Polybase series.  If you haven’t seen this already, I have a twenty-some part series on Polybase in SQL Server 2016 and now SQL Server 2017.  Polybase hasn’t changed with the RTM release of SQL Server 2017, so I haven’t needed to write much new about the topic lately.  That’s my excuse for being lazy, at least.

Additional Resources

If you’re new here, you might be interested in the rest of my Polybase series, which I have in rough reading order.  I also turned this series into a presentation.

You may also like my Kafka series, which led to a presentation on the topic.  Kafka is a fantastic distributed message broker which lives in the Hadoop ecosystem, and I can’t say enough good things about it.

Maybe you’re curious about this Hadoop thing but don’t know where to start.  Go check out my presentation introducing Hadoop, which I have recorded as well.

After that, if you want to see several methods you could use to connect Hadoop to SQL Server (including Polybase), check out my presentation on the topic, which the folks at NDC Sydney helpfully recorded.

Finally, if you just want to get an idea of what the modern data platform space looks like, I have a presentation on that as well.  Carlos L Chacon interviewed me on the SQL Data Partners podcast in Episode 47 and we talked about this in depth, including where Hadoop fits into this ever-growing combination of technologies.

Now, On With The Show

In today’s post, I want to look at trying to connect a SQL Server instance to Azure HDInsight and see if I can take advantage of predicate pushdown.  The short answer is, not really.  The long answer is, read on.

 

Let’s Spin Up HDInsight

The first step here is to start up an HDInsight cluster.  We won’t need a powerful cluster for this demo.

First, in the Azure Portal, search for HDInsight and you’ll get the option to create a new cluster.  Here are my cluster configuration options:

1_ClusterConfiguration

I just wanted a classic Hadoop cluster and didn’t need to install HBase, Kafka, or anything else.  Note that Windows is an option for operating system, but nobody in their right mind runs Hadoop on Windows.  It’s dead; leave it be.

Next, fill out the basics.  Here’s my setup:

2_Basics

You can configure your storage options after that.  I’m choosing the default of Azure Storage, though you can also read from Azure Data Lake Storage.  If you want Hive or Oozie metadata to persist after you kill the cluster—something that you might want to do in production as you shut down the HDInsight cluster when not in use—you can choose an Azure SQL Database in the dropdowns at the bottom.

3_Storage

Once you’ve got storage set up, you want to edit the cluster size.  By default, you get two D12 head node and 4 D4 worker nodes.  This will run you $3.63 an hour in East US.  That’s about $87 a day, or roughly $2600 per month.

4_EditClusterSize

 

 

 

You probably want to trim that down a bit if you’re just playing around.  Here’s what I ended up with:

5_SmallerNodes

Just using 2 D3 nodes cuts the price in half.  It’d be a terrible demo for Hadoop and completely unfair in a performance test, but our purpose here is just to get a simple Polybase demo working, not provide a comprehensive study on Hadoop performance.

Once you’ve got everything set up, expect to stare at this for a while:

6_DeployingHDInsight

Configuring For Predicate Pushdown

Next up, we want to configure Polybase to use our Hadoop installation.  To get the right version of Hadoop, I need to SSH to the NameNode:

7_SSH

My HDInsight cluster is using Hortonworks Data Platform version 2.6.2.2-5:

8_HDPVersion

That means that in my yarn-site.xml file, all references will be to /usr/hdp/2.6.2.2-5.

The Problem

But now we run into a problem:  there are certain ports which need to be open for Polybase to work.  This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs.  This goes back to the issue we see with spinning up data nodes in Docker:  ports are not available.  If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.

One Workaround

There’s one consistent workaround that I see people use:  have Azure Blob Storage work as an intermediary.  In this scenario, you’d use Polybase to write to Azure Blob Storage and then pull that data into your Azure Storage local to the HDInsight cluster.  This works for data transfer, but it doesn’t allow for predicate pushdown.

Using A Deployment Template

Robert Alexander has created a deployment template to spin up a SQL Server instance and an HDInsight cluster with a sample data set, so if you have an Azure subscription and want to play along, it’s pretty easy.  Just hit the “Deploy” button, fill in a few details, and be prepared to wait half an hour.  Note that the average daily cost is just over $150 for this demo, so you probably want to delete the objects as soon as you can to save money.  You will be charged for an HDInsight cluster even if you’ve shut off all of the machines.  There is no free pause option; the only way not to pay for the service is to delete the cluster.

Unfortunately, I was not able to get this template working.  I got through step one of the deployment but got stuck with a status of Action Required:

10_ActionRequired

When clicking on the deployment link, it’d sit there spinning, waiting for something to load.  In my browser console, I saw the problem:

9_APIError

Trying to go to that link gives an error message:

{"message":"You don't have access to this deployment ({deploymentID})."}

In this case, {deploymentID} is the exposed GUID after /deployments/ in the link above.  I tried this with a couple of accounts, but was unable to spin up this demo with either of them.

Conclusion

I’m not particularly fond of HDInsight at this point.  There’s a lot of effort to get started, and frankly, if I’m going to go through all of this effort, I’ll just build my own Cloudera or Hortonworks cluster and get full control.  I’ve also been unable to use Polybase on my on-prem SQL Server instance against an HDInsight node.  There are cases where I can see HDInsight being useful, but I don’t think there’s enough time and administrative savings over building your own Hadoop cluster to make it a worthwhile choice.  Maybe in a few years they’ll have it nailed down, but for now, I’d stick to custom-built Hadoop either on-prem or as VMs in a cloud provider like Azure or AWS.

 

New Video: Much Ado About Hadoop

I’ve decided to start recording my presentations to ensure that I have them in video form.  They’re much closer to one-take videos than polished, professional videos, but you’ve got to start somewhere.

Here is my first video, Much Ado About Hadoop.

The main audience for this video is people who are interested in learning a bit about the Hadoop ecosystem from a high level but not sure where to begin.  I cover this from a historical perspective, explaining what the major components of Hadoop are, why they made sense in the context of contemporary hardware and software paradigms, and how the Hadoop ecosystem has evolved in response to changes in hardware and software.  This talk is approximately 90 minutes in length and is an expanded version of the talk I’ve given at SQL Saturdays.

From here, as I get a bit more familiar with my software and equipment—and as I get more time—I will post additional videos.  I already know what my second video will be and want to get it out in the next two or three weeks, depending upon circumstances.  From there, who knows?

NDC Sydney Talks Up

NDC has released videos for NDC Sydney.  The NDC Conferences channel has the full set, and there were some fantastic talks this year.

I gave two talks at the conference:

They were surprisingly not awful!  Probably due to all of the post-production…

My favorite talk from NDC Sydney was Laura Bell talking about security for startups.  Definitely watch that one too; she gives four easy tips for us to use to improve our companies’ security postures.

Compress Those Models

I’m currently working on a project to generate a model in SQL Server R Services.  There are some nice demos on how to store a model (like this one that Microsoft has involving ski resort room rental), but there’s one thing that most of these leave out:  scale.

When you serialize a model to the database, it’s going to take up some amount of space.  In our case, we’re using a few different types of models.  Neural nets are relatively svelte, at roughly 40K per model.  On the other end, decision trees can take 2-4 MB per model.  When you’re storing a few models in SQL Server, size isn’t generally much of an issue.  Even if you’re using temporal tables to store model history, it’s probably not a big problem.  But when you’re storing a lot of data, it’s a big problem.

How much is a lot?  Well, let’s start with this:

BeforeCompression

These are some numbers for an early beta.  We have 91,130 products modeled and a model history of 194,830 products.  Each model, on its own, averages out to 177K in our ProductModel table and 212K in ProductModelHistory.  As mentioned, different types of models take up different amounts of space (and I didn’t elaborate upon all the types of models we’re using), so we wouldn’t expect the aggregates to be exactly the same.  I included a contrast:  we’ve generated 16.4 million quantity sold predictions during this time, but the total disk space is under 1 GB and 2/3 of that is a unique nonclustered index.  The 372MB is columnstore.  I love columnstore…

Compression

So right now, we’re burning roughly 200K per model.  My stated goal is to be able to store several years worth of data for 10 million products.  Let’s say that I need 10 million products in ProductModel and 1 billion rows in ProductModelHistory.  That means that we’d end up with 1.86 TB of data in the ProductModel table and 186 TB in ProductModelHistory.  This seems…excessive.

As a result, I decided to try using the COMPRESS() function in SQL Server 2016.  The COMPRESS function simply uses GZip compression.  Yeah, there are compression algorithms which tend to be more compact (e.g., bz2 or 7z), but GZip is relatively CPU efficient and I can wrap my SQL statements with COMPRESS() and DECOMPRESS() and not have to change any calling code; I just need to update the two stored procedures I use to insert and then retrieve product models.

With compression in place, my data on disk looks much less unreasonable:

PostCompression

That’s a drop from 200K per model down to about 45K per model, or an overall 4.7x compression ratio.  I noticed that the type of model was instrumental in the amount of compression.  Our neural nets compressed roughly 2x, going from approximately 40K per model down to 20K per model.  Meanwhile, those decision trees went from 2-4MB down to about 200K per model.

At 45K per model, storing 10 million models is about 429 GB, and 1 billion rows is just shy of 43 TB.  For my needs, that saves approximately 140 TB of disk space.  Your mileage will probably vary, but if you’re generating a large number of models, think about compression.

Feasel’s Law

I’ve had this idea on my mind for a while and figured I’d might as well write it down…

Feasel’s Law – Any sufficiently advanced data retrieval process will eventually have a SQL interface.

This is a bit tongue-in-cheek, but it plays out pretty well; typically, a major sign of database maturity is its reimplementation (at least in part) of SQL.  Examples:

  • Hadoop started out as writing map-reduce jobs in Java.  Yahoo then created Pig to work with ETL processes.  After that, Facebook created Hive, an implementation of SQL for Hadoop.
  • Spark started out using Java and Scala, and then Python.  Not too long after, we had Shark and then Spark SQL.
  • HBase is a NoSQL database on top of Hadoop.  Now we have Phoenix, which is a SQL interface to HBase.
  • The primary methods of operation with Cassandra are SQL-like statements.
  • Riak TS has a SQL interface.  Incidentally, I love the author’s comment that SQL isn’t dead yet…
  • The easiest way to access CosmosDB data?  SQL.  That interface isn’t fully baked yet—it doesn’t allow grouping, for example—but it’s getting there.

One of the few strong counter-examples is MongoDB, which doesn’t have a SQL interface but does have a SQL translation guide.  DynamoDB also does not offer a SQL interface, though there are third-party interfaces and management tools which give you the same effect.

Otherwise, if you’re using a database that was created after 1990, has any significant user base, and is a mature platform, chances are it’s got a native SQL interface.

Marketplaces In Computing, Part 2

Last time around, I looked at the tragedy of the commons as it relates to server resources.  Today, I want to introduce two patterns for optimal resource utilization given different sets of actors with independent goals.

The Microservice Solution

Concept

The first use pattern is the easier one:  remove the contention by having independent teams own their own hardware.  Each team gets its own servers, its own databases, and its own disk and network capacity.  Teams develop on their own servers and communicate via proscribed APIs.  Teams have budgets to purchase hardware and have a couple of options:  they can rent servers from a cloud provider like AWS or Azure, they could rent hardware from an internal operations team, or they could purchase and operate their own hardware.  Most likely, organizations won’t like option #3, so teams are probably looking at some mixture of options 1 and 2.

Benefits

If a team depends upon another team’s work—for example, suppose that we have a core service like inventory lookup, and other teams call the inventory lookup service—then the team owning the service has incentives to prevent system over-subscription.  That team will be much more likely to throttle API calls, improve call efficiency by optimizing queries or introducing better caching, or find some other way to make best use of their limited resources.

This situation removes the commons by isolating teams onto their own resources and making the teams responsible for these resources.

Downsides

There are a couple downsides to this arrangement.  First, doing your work through APIs could be a major latency and resource drag.  In a not-so-extreme case, think about two tables within a common service boundary.  The ProductInventory table tells us the quantity of each product we sell.  The ProductAdvertisement table tells us which products we are currently advertising.  Suppose we want to do a simple join of Product to ProductAdvertisement to ProductInventory and get the products where we have fewer than 5 in stock and are paying at least $1 per click on our digital marketing campaign.

In a standard SQL query running against a modern relational database, the query optimizer use statistics to figure out which table to start from:  ProductAdvertisement or ProductInventory.  If we have relatively few products with fewer than 5 in stock, we drive from ProductInventory.  If we have relatively few advertisements with a $1 CPC, we drive from ProductAdvertisement.  If neither of those is sensitive on its own but the inner join of those two is tiny, the optimizer might start with ProductInventory and join to ProductAdvertisement first, taking the relatively small number of results and joining to Product to flesh out those product details.  Regardless of which is the case, the important thing here is that the database engine has the opportunity to optimize this query because the database engine has direct access to all three tables.

If we put up an API guarding ProductInventory, we can now run into an issue where our advertising team needs to drive from the ProductAdvertisement table.  Suppose that there are tens of thousands of products with at least $1 CPC but only a few with fewer than 5 in stock.  We don’t know which products have fewer than 5 in stock and unless the inventory API has a “get all products with fewer than X in stock” method, we’re going to need to get the full list of products with ads with $1+ CPC and work our way through the inventory API until we find those products with fewer than 5 in stock.  So instead of a single, reasonably efficient database query, we could end up with upwards of tens of thousands of database lookups.  Each lookup on its own is efficient but the sum total is rather inefficient in comparison, both in terms of the amount of time spent pushing data for hundreds of thousands of products across the network and then filtering in code (or another database) and in terms of total resource usage.

The other problem, aside from resource latency, is that the throttling team does not always have the best signals available to determine what to throttle and why.  If our inventory team sees the advertising team slamming the inventory API, the inventory team may institute code which starts returning 429 Too Many Requests response codes, forcing the advertising team to slow down their calls.  This fixes the over-subscription problem but this might be a chainsaw solution to a scalpel problem.  In other words, suppose the advertisements team has two different operations:  a low-value operation with a number of requests, and a high-value operation with a number of requests.  The inventory team doesn’t necessarily know which operation is which, so without coordination between the teams, the inventory team might accidentally block high-value operations while letting low-value operations through.  Or they may cooperate and block low-value operations, but do so much blocking that they starve the low-value operation instead of simply throttling it back.  Neither of those answers is great.

The Price Signal

Instead of having different teams own their own hardware and try to live in silos, my preferred solution is to institute price signals.  What follows is an ideal (some might say overkill) implementation.

In this setup, Operations owns the servers.  Like the farmer in my previous example, ops teams want their property to remain as useful as possible.  Operations wants to make their servers enticing and prohibit over-subscription.  To do this, they price resource utilization.  On a server with a very low load, teams can use the server for pennies per hour; when the server is at 100% CPU utilization, it might spike to $20 or $30 per hour to use that server.  There are three important factors here:

  1. All teams have real-time (or very close to real-time) knowledge of the spot price of each server.
  2. Operations may set the price as they see fit.  That might lead to out-of-equilibrium prices, but there’s a factor that counteracts that quite well:
  3. The prices are actual dollar prices.

Operations is no longer a cost center within the company; it now has the opportunity to charge other teams for resource utilization.  If Operations does a good job keeping servers running efficiently and prices their products well, they end up earning the money to expand and improve; if they struggle, they shape up or ship out.  That’s because teams have alternatives.

Viable Alternatives

Suppose the Operations team can’t manage a server to save their lives.  Product teams are free to go back to options #1 or #3:  they can use cloud offerings and set up their own services there, or they can purchase their own hardware and build internal operations teams within the product team.  These real forms of competition force the Operations team to perform at least well enough to keep their customers.  If I’m going to pay more for servers from Operations than I am from AWS, I had better get something in return.  Sure, lock-in value is real and will play a role in keeping me on the Operations servers, but ops needs to provide additional value:  lower-latency connections, the ability to perform queries without going through APIs (one of the big downsides to the microservices issue above), people on staff when things go bump in the night, etc.

These viable alternatives will keep the prices that Operations charge fairly reasonable; if they’re charging $3 per gigabyte of storage per month, I’ll laugh at them and store in S3 or Azure Blob Storage for a hundredth of the price.  If they offer 5 cents per gigabyte per month on local flash storage arrays, I’ll play ball.

Taking Pricing Into Account

FA Hayek explained the importance of price as a signal in his paper, The Use of Knowledge in Society.  Price is the mechanism people use to share disparate, situational, and sometimes contradictory or at least paradoxical information that cannot otherwise be aggregated or collected in a reasonable time or in a reasonable fashion.  Tying this work to our situation, we want to explain resource utilization to disparate teams at various points in time.  We can return a bunch of numbers back and hope for the best, but if I tell the inventory team that they’re using 48% of a database server’s CPU resources and that the current CPU utilization is 89%, what does that mean?  Does that mean they can increase their load?  That they should decrease their load?  That things are doing just fine?

By contrast, we tell the inventory team that right now, the spot price of this server is $5 per CPU hour, when the normal price is 20 cents per CPU hour.  This is a signal that the server is under heavy load and maybe I should cut back on those giant warehouse-style queries burning up 16 cores.

When teams know that the price has jumped like this, they now have a number of options available:

  1. Prioritize resource utilization.  Are there any low-priority operations going on right now?  Maybe it’d be wise to reschedule those for later, when the server is cheap again.
  2. Refocus efforts elsewhere.  If there’s a server which regularly gets too expensive, maybe it’d be wise to relocate to someplace else, where the price is cheaper.  This can spread the load among servers and make resource utilization more efficient.  As a reminder, that server doesn’t need to be on-prem where Operations owns it; it’s a pretty big world out there with plenty of groups willing to rent some space.
  3. Tune expensive operations.  When dollars and cents come into play, it’s easy to go to a product owner with an ROI.  If my advertising team just got hit with a $5000 bill for a month’s worth of processing on this server and I know I can cut resource usage down to a tenth, I’m saving $4500 per month.  If my next-best alternative does not bring my team at least that much (plus the incremental cost of resource utilization) in revenue, it’s a lot harder for product owners to keep engineering teams from doing tech debt cleanup and resource optimization work.
  4. Burn the money.  Sometimes, a team just has to take it on the chin; all of this work is important and the team needs to get that work done.

Getting There From Here

Okay, so now that I’ve spent some time going over what the end game looks like, how do we possibly get there from here?  I’ll assume that “here” is like most companies I’ve worked at:  there’s a fairly limited understanding of what’s causing server heartache and a limited amount of responsibility that product teams take.

Here are the steps as I see them:

  1. Implement resource tracking.  Start with resource tracking as a whole if you don’t already have it.  Cover per-minute (or per some other time period) measures of CPU load, memory utilization, disk queue length, network bandwidth, and disk utilization.  Once those are in place, start getting resource usage by team.  In SQL Server, that might mean tagging by application name.
  2. Figure out pricing.  Without solid knowledge of exactly where to begin, there are still two pieces of interesting information:  what other suppliers are charging and break-even cost for staff + servers + whatnot.  Unless you’ve got junk servers and terrible ops staff, you should be able to charge at least a little bit more than AWS/Azure/Google/whatever server prices.  And if your ops team is any good, you can charge a good bit more because you’re doing things like waking up at night when the server falls over.
  3. Figure out budgeting.  This is something that has to come from the top, and it has to make sense.  Your higher-value teams probably will get bigger budgets.  You may not know at the start what the budgets “should” be for teams, but at least you can start with current resource utilization shares.
  4. Figure out the market.  You’ll need an API to show current server price.  Teams can call the API and figure out the current rate.  Ideally, you’re also tracking per-team utilization and pricing like Azure or AWS does to limit surprise.

Once this is in place, it gives teams a way of throttling their own utilization.  There’s still a chance for over-subscription, though, so let’s talk about one more strategy:  auctions.

Auctions

Thus far, we’ve talked about this option as a specific decision that teams make.  But when it comes to automated processes, companies like Google have proven that auctions work best.  In the Google advertisement example, there is a limited resource—the number of advertisement slots on a particular search’s results—and different people compete for those slots.  They compete by setting a maximum cost per click, and Google takes that (plus some other factors) and builds a ranking, giving the highest score the best slot, followed by the next-highest score, etc. until all the slots are filled.

So let’s apply that to our circumstance here.  Instead of simply having teams work through their resource capacity issues—a long-term solution but one which requires human intervention—we could auction off resource utilization.  Suppose the current spot price for a server is 5 cents per CPU hour because there’s little load.  Each team has an auction price for each process—maybe we’re willing to pay $10 per CPU hour for the most critical requests, followed by $1 per hour for our mid-importance requests, followed by 50 cents per hour for our least-important requests.  Other teams have their requests, too, and somebody comes in and floods the server with requests.  As resource utilization spikes, the cost of the server jumps up to 75 cents per CPU hour, and our 50-cent job stops automatically.  It jumps again to $4 per CPU hour and our $1 job shuts off.

That other team is running their stuff for a really long time, long enough that it’s important to run the mid-importance request.  Our team’s internal API knows this and therefore automatically sets the bid rate up to $5 temporarily, setting it back down to $1 once we’ve done enough work to satisfy our team’s processing requirements.

Implementing this strategy requires a bit more sophistication, as well as an understanding on the part of the product teams of what happens when the spot price goes above the auction price—that jobs can stop, and it’s up to product teams to spin them down nicely.

Another Spin:  Funbucks

Okay, so most companies don’t like the idea of giving product teams cash and having them transfer real money to an Operations team.  So instead of doing this, you can still have a market system.  It isn’t quite as powerful because there are fewer options available—you might not be able to threaten abandoning the current set of servers for AWS—but it can still work.  Each team still has a budget, but the budget is in an internal scrip.  If you run out of that internal scrip, it’s up to higher-ups to step in.  This makes it a weaker solution, but still workable.

So, You’re Not Serious…Right?

Of course I am, doubting title.  I’m so serious that I’ll even point out cases where what I’m talking about is already in place!

First of all, AWS offers spot pricing on EC2 instances.  These prices tend to be less than the on-demand price and can be a great deal for companies which can afford to run processes at off hours.  You can write code to check the spot price and, if the spot price is low enough, snag an instance and do some work.

As a great example of this, Databricks offers their Community Edition free of charge and uses AWS spot instances to host these.  That keeps prices down for Databricks because they have a hard cap on how high they’re willing to go—I’ve had cases where I’ve tried to spin up a CE cluster and gotten a failure indicating that the spot price was too high and that I should try again later.

For the graybeards in the audience, you’ll also appreciate this next example:  mainframe time slicing.  This was a common strategy for pricing computer utilization and is very similar to what I’m describing.

Conclusion

We’ve spent the past couple of days looking at how development teams can end up in a tragedy of the commons, and different techniques we can use to extricate ourselves from it.  The main purpose of these posts is to show that there are several options available, including creating markets internally.  We still haven’t talked about agorics yet, but let’s save that for another day.