Feasel’s Law

I’ve had this idea on my mind for a while and figured I’d might as well write it down…

Feasel’s Law – Any sufficiently advanced data retrieval process will eventually have a SQL interface.

This is a bit tongue-in-cheek, but it plays out pretty well; typically, a major sign of database maturity is its reimplementation (at least in part) of SQL.  Examples:

  • Hadoop started out as writing map-reduce jobs in Java.  Yahoo then created Pig to work with ETL processes.  After that, Facebook created Hive, an implementation of SQL for Hadoop.
  • Spark started out using Java and Scala, and then Python.  Not too long after, we had Shark and then Spark SQL.
  • HBase is a NoSQL database on top of Hadoop.  Now we have Phoenix, which is a SQL interface to HBase.
  • The primary methods of operation with Cassandra are SQL-like statements.
  • Riak TS has a SQL interface.  Incidentally, I love the author’s comment that SQL isn’t dead yet…
  • The easiest way to access CosmosDB data?  SQL.  That interface isn’t fully baked yet—it doesn’t allow grouping, for example—but it’s getting there.

One of the few strong counter-examples is MongoDB, which doesn’t have a SQL interface but does have a SQL translation guide.  DynamoDB also does not offer a SQL interface, though there are third-party interfaces and management tools which give you the same effect.

Otherwise, if you’re using a database that was created after 1990, has any significant user base, and is a mature platform, chances are it’s got a native SQL interface.

Advertisements

Marketplaces In Computing, Part 2

Last time around, I looked at the tragedy of the commons as it relates to server resources.  Today, I want to introduce two patterns for optimal resource utilization given different sets of actors with independent goals.

The Microservice Solution

Concept

The first use pattern is the easier one:  remove the contention by having independent teams own their own hardware.  Each team gets its own servers, its own databases, and its own disk and network capacity.  Teams develop on their own servers and communicate via proscribed APIs.  Teams have budgets to purchase hardware and have a couple of options:  they can rent servers from a cloud provider like AWS or Azure, they could rent hardware from an internal operations team, or they could purchase and operate their own hardware.  Most likely, organizations won’t like option #3, so teams are probably looking at some mixture of options 1 and 2.

Benefits

If a team depends upon another team’s work—for example, suppose that we have a core service like inventory lookup, and other teams call the inventory lookup service—then the team owning the service has incentives to prevent system over-subscription.  That team will be much more likely to throttle API calls, improve call efficiency by optimizing queries or introducing better caching, or find some other way to make best use of their limited resources.

This situation removes the commons by isolating teams onto their own resources and making the teams responsible for these resources.

Downsides

There are a couple downsides to this arrangement.  First, doing your work through APIs could be a major latency and resource drag.  In a not-so-extreme case, think about two tables within a common service boundary.  The ProductInventory table tells us the quantity of each product we sell.  The ProductAdvertisement table tells us which products we are currently advertising.  Suppose we want to do a simple join of Product to ProductAdvertisement to ProductInventory and get the products where we have fewer than 5 in stock and are paying at least $1 per click on our digital marketing campaign.

In a standard SQL query running against a modern relational database, the query optimizer use statistics to figure out which table to start from:  ProductAdvertisement or ProductInventory.  If we have relatively few products with fewer than 5 in stock, we drive from ProductInventory.  If we have relatively few advertisements with a $1 CPC, we drive from ProductAdvertisement.  If neither of those is sensitive on its own but the inner join of those two is tiny, the optimizer might start with ProductInventory and join to ProductAdvertisement first, taking the relatively small number of results and joining to Product to flesh out those product details.  Regardless of which is the case, the important thing here is that the database engine has the opportunity to optimize this query because the database engine has direct access to all three tables.

If we put up an API guarding ProductInventory, we can now run into an issue where our advertising team needs to drive from the ProductAdvertisement table.  Suppose that there are tens of thousands of products with at least $1 CPC but only a few with fewer than 5 in stock.  We don’t know which products have fewer than 5 in stock and unless the inventory API has a “get all products with fewer than X in stock” method, we’re going to need to get the full list of products with ads with $1+ CPC and work our way through the inventory API until we find those products with fewer than 5 in stock.  So instead of a single, reasonably efficient database query, we could end up with upwards of tens of thousands of database lookups.  Each lookup on its own is efficient but the sum total is rather inefficient in comparison, both in terms of the amount of time spent pushing data for hundreds of thousands of products across the network and then filtering in code (or another database) and in terms of total resource usage.

The other problem, aside from resource latency, is that the throttling team does not always have the best signals available to determine what to throttle and why.  If our inventory team sees the advertising team slamming the inventory API, the inventory team may institute code which starts returning 429 Too Many Requests response codes, forcing the advertising team to slow down their calls.  This fixes the over-subscription problem but this might be a chainsaw solution to a scalpel problem.  In other words, suppose the advertisements team has two different operations:  a low-value operation with a number of requests, and a high-value operation with a number of requests.  The inventory team doesn’t necessarily know which operation is which, so without coordination between the teams, the inventory team might accidentally block high-value operations while letting low-value operations through.  Or they may cooperate and block low-value operations, but do so much blocking that they starve the low-value operation instead of simply throttling it back.  Neither of those answers is great.

The Price Signal

Instead of having different teams own their own hardware and try to live in silos, my preferred solution is to institute price signals.  What follows is an ideal (some might say overkill) implementation.

In this setup, Operations owns the servers.  Like the farmer in my previous example, ops teams want their property to remain as useful as possible.  Operations wants to make their servers enticing and prohibit over-subscription.  To do this, they price resource utilization.  On a server with a very low load, teams can use the server for pennies per hour; when the server is at 100% CPU utilization, it might spike to $20 or $30 per hour to use that server.  There are three important factors here:

  1. All teams have real-time (or very close to real-time) knowledge of the spot price of each server.
  2. Operations may set the price as they see fit.  That might lead to out-of-equilibrium prices, but there’s a factor that counteracts that quite well:
  3. The prices are actual dollar prices.

Operations is no longer a cost center within the company; it now has the opportunity to charge other teams for resource utilization.  If Operations does a good job keeping servers running efficiently and prices their products well, they end up earning the money to expand and improve; if they struggle, they shape up or ship out.  That’s because teams have alternatives.

Viable Alternatives

Suppose the Operations team can’t manage a server to save their lives.  Product teams are free to go back to options #1 or #3:  they can use cloud offerings and set up their own services there, or they can purchase their own hardware and build internal operations teams within the product team.  These real forms of competition force the Operations team to perform at least well enough to keep their customers.  If I’m going to pay more for servers from Operations than I am from AWS, I had better get something in return.  Sure, lock-in value is real and will play a role in keeping me on the Operations servers, but ops needs to provide additional value:  lower-latency connections, the ability to perform queries without going through APIs (one of the big downsides to the microservices issue above), people on staff when things go bump in the night, etc.

These viable alternatives will keep the prices that Operations charge fairly reasonable; if they’re charging $3 per gigabyte of storage per month, I’ll laugh at them and store in S3 or Azure Blob Storage for a hundredth of the price.  If they offer 5 cents per gigabyte per month on local flash storage arrays, I’ll play ball.

Taking Pricing Into Account

FA Hayek explained the importance of price as a signal in his paper, The Use of Knowledge in Society.  Price is the mechanism people use to share disparate, situational, and sometimes contradictory or at least paradoxical information that cannot otherwise be aggregated or collected in a reasonable time or in a reasonable fashion.  Tying this work to our situation, we want to explain resource utilization to disparate teams at various points in time.  We can return a bunch of numbers back and hope for the best, but if I tell the inventory team that they’re using 48% of a database server’s CPU resources and that the current CPU utilization is 89%, what does that mean?  Does that mean they can increase their load?  That they should decrease their load?  That things are doing just fine?

By contrast, we tell the inventory team that right now, the spot price of this server is $5 per CPU hour, when the normal price is 20 cents per CPU hour.  This is a signal that the server is under heavy load and maybe I should cut back on those giant warehouse-style queries burning up 16 cores.

When teams know that the price has jumped like this, they now have a number of options available:

  1. Prioritize resource utilization.  Are there any low-priority operations going on right now?  Maybe it’d be wise to reschedule those for later, when the server is cheap again.
  2. Refocus efforts elsewhere.  If there’s a server which regularly gets too expensive, maybe it’d be wise to relocate to someplace else, where the price is cheaper.  This can spread the load among servers and make resource utilization more efficient.  As a reminder, that server doesn’t need to be on-prem where Operations owns it; it’s a pretty big world out there with plenty of groups willing to rent some space.
  3. Tune expensive operations.  When dollars and cents come into play, it’s easy to go to a product owner with an ROI.  If my advertising team just got hit with a $5000 bill for a month’s worth of processing on this server and I know I can cut resource usage down to a tenth, I’m saving $4500 per month.  If my next-best alternative does not bring my team at least that much (plus the incremental cost of resource utilization) in revenue, it’s a lot harder for product owners to keep engineering teams from doing tech debt cleanup and resource optimization work.
  4. Burn the money.  Sometimes, a team just has to take it on the chin; all of this work is important and the team needs to get that work done.

Getting There From Here

Okay, so now that I’ve spent some time going over what the end game looks like, how do we possibly get there from here?  I’ll assume that “here” is like most companies I’ve worked at:  there’s a fairly limited understanding of what’s causing server heartache and a limited amount of responsibility that product teams take.

Here are the steps as I see them:

  1. Implement resource tracking.  Start with resource tracking as a whole if you don’t already have it.  Cover per-minute (or per some other time period) measures of CPU load, memory utilization, disk queue length, network bandwidth, and disk utilization.  Once those are in place, start getting resource usage by team.  In SQL Server, that might mean tagging by application name.
  2. Figure out pricing.  Without solid knowledge of exactly where to begin, there are still two pieces of interesting information:  what other suppliers are charging and break-even cost for staff + servers + whatnot.  Unless you’ve got junk servers and terrible ops staff, you should be able to charge at least a little bit more than AWS/Azure/Google/whatever server prices.  And if your ops team is any good, you can charge a good bit more because you’re doing things like waking up at night when the server falls over.
  3. Figure out budgeting.  This is something that has to come from the top, and it has to make sense.  Your higher-value teams probably will get bigger budgets.  You may not know at the start what the budgets “should” be for teams, but at least you can start with current resource utilization shares.
  4. Figure out the market.  You’ll need an API to show current server price.  Teams can call the API and figure out the current rate.  Ideally, you’re also tracking per-team utilization and pricing like Azure or AWS does to limit surprise.

Once this is in place, it gives teams a way of throttling their own utilization.  There’s still a chance for over-subscription, though, so let’s talk about one more strategy:  auctions.

Auctions

Thus far, we’ve talked about this option as a specific decision that teams make.  But when it comes to automated processes, companies like Google have proven that auctions work best.  In the Google advertisement example, there is a limited resource—the number of advertisement slots on a particular search’s results—and different people compete for those slots.  They compete by setting a maximum cost per click, and Google takes that (plus some other factors) and builds a ranking, giving the highest score the best slot, followed by the next-highest score, etc. until all the slots are filled.

So let’s apply that to our circumstance here.  Instead of simply having teams work through their resource capacity issues—a long-term solution but one which requires human intervention—we could auction off resource utilization.  Suppose the current spot price for a server is 5 cents per CPU hour because there’s little load.  Each team has an auction price for each process—maybe we’re willing to pay $10 per CPU hour for the most critical requests, followed by $1 per hour for our mid-importance requests, followed by 50 cents per hour for our least-important requests.  Other teams have their requests, too, and somebody comes in and floods the server with requests.  As resource utilization spikes, the cost of the server jumps up to 75 cents per CPU hour, and our 50-cent job stops automatically.  It jumps again to $4 per CPU hour and our $1 job shuts off.

That other team is running their stuff for a really long time, long enough that it’s important to run the mid-importance request.  Our team’s internal API knows this and therefore automatically sets the bid rate up to $5 temporarily, setting it back down to $1 once we’ve done enough work to satisfy our team’s processing requirements.

Implementing this strategy requires a bit more sophistication, as well as an understanding on the part of the product teams of what happens when the spot price goes above the auction price—that jobs can stop, and it’s up to product teams to spin them down nicely.

Another Spin:  Funbucks

Okay, so most companies don’t like the idea of giving product teams cash and having them transfer real money to an Operations team.  So instead of doing this, you can still have a market system.  It isn’t quite as powerful because there are fewer options available—you might not be able to threaten abandoning the current set of servers for AWS—but it can still work.  Each team still has a budget, but the budget is in an internal scrip.  If you run out of that internal scrip, it’s up to higher-ups to step in.  This makes it a weaker solution, but still workable.

So, You’re Not Serious…Right?

Of course I am, doubting title.  I’m so serious that I’ll even point out cases where what I’m talking about is already in place!

First of all, AWS offers spot pricing on EC2 instances.  These prices tend to be less than the on-demand price and can be a great deal for companies which can afford to run processes at off hours.  You can write code to check the spot price and, if the spot price is low enough, snag an instance and do some work.

As a great example of this, Databricks offers their Community Edition free of charge and uses AWS spot instances to host these.  That keeps prices down for Databricks because they have a hard cap on how high they’re willing to go—I’ve had cases where I’ve tried to spin up a CE cluster and gotten a failure indicating that the spot price was too high and that I should try again later.

For the graybeards in the audience, you’ll also appreciate this next example:  mainframe time slicing.  This was a common strategy for pricing computer utilization and is very similar to what I’m describing.

Conclusion

We’ve spent the past couple of days looking at how development teams can end up in a tragedy of the commons, and different techniques we can use to extricate ourselves from it.  The main purpose of these posts is to show that there are several options available, including creating markets internally.  We still haven’t talked about agorics yet, but let’s save that for another day.

Marketplaces In Computing, Part 1

I tend to see resource over-subscription problems frequently at work.  We have a set of product teams, and each team has a manager, a product owner, and a set of employees.  These teams share computing resources, though:  they use the same servers, access the same databases, and use the same networks.  This leads to a tragedy of the commons scenario.

Tragedy of the Commons

The tragedy of the commons is a classic concept in economics:  the basic idea is that there exists some common area accessible to all and owned (directly) by none.  Each member of a group is free to make use of this common area.  Let’s get a little more specific and talk about a common grazing field that a group of shepherd share.  Each shepherd has a total desired utilization of the field:  shepherd A has x sheep which will consume f(x) grass, where 0 <= f(x) <= 1.  Shepherd B has y sheep consuming f(y) grass, and shepherd C has z sheep, consuming f(z) grass.

If f(x) + f(y) + f(z) < 1, then the three shepherds can share the commons.  But let’s say that f(x) = 0.5, f(y) = 0.4, and f(z) = 0.3.  That adds up to 1.2, but we only have enough grass for 1.0.  This means that at least one of the three shepherds will end up losing out on this arrangement—if shepherds A and B get there early, their sheep are going to consume 90% of the available vegetation, leaving shepherd C with 10% instead of his needed 30%.

The tragedy of the commons goes one step further:  because there is overgrazing on this land, eventually the vegetation dies out and the land goes fallow for some time, leaving the shepherds without that place to graze.

There are different spins on the tragedy of the commons, but this is the scenario I want to walk through today.

Solutions to the Tragedy

There are three common solutions to a tragedy of the commons scenario:  private ownership, a governing body, or commons rules.

Private Ownership

This is the easiest solution to understand.  Instead of the three shepherds grazing freely on the commons, suppose that farmer D purchases the land with the intent of subletting it for grazing.  Farmer D will then charge the shepherds some amount to graze on the land, with the goal being to earn as much as he can from use of the land.  Right now, the shepherds are oversubscribed to the land at a cost of 0.  Once the farmer starts charging the shepherds, then we see movement.

As a simple example (without loss of generality), let’s say that the farmer knows that he can’t allow more than 80% of the land to be grazed; if he goes above that mark, we get into overgrazing territory and run the risk of the land going fallow for some time.  Thus, the farmer wants to set prices such that no more than 80% of the land gets grazed and the farmer maximizes his profits.

We won’t set up the equilibrium equation here, but if you really want to go down that route, you’re welcome to.  Let’s say that the equilibrium price is $10 per acre, and at $10 per acre, shepherd B realizes that he can go graze somewhere else for less.  That leaves shepherds A and C, whose total use adds up to 80%.  As an aside, we could just as easily have imagined a scenario where all three shepherds still use the plot but at least one uses less than the f(x, $0) amount would indicate, so that the sum was still somewhere around 80% utilization.

Central Planning

In contrast to the private property solution, central planning involves an individual or committee laying down edicts on use patterns.  In this case, we have a centralplanner telling each shepherd how much of the land he is allowed to use.  We can work from the assumption that the central owner also knows that overgrazing happens at more than 80% of utilization, so the central planner won’t allow shepherds to graze beyond this.  How specifically the central planner allocates the resource is beyond the scope of discussion here, as are important issues like public choice theory.

Commons Rules

The third method of commons resource planning comes from Elinor Ostrom’s work (with this being a pretty good summary in a short blog post).  The gist of it is that local groups tend to formulate rules for behavior on the commons.  This is group decision making in an evolutionary context, in contrast to spontaneous order (the finding of equilibrium using market prices) or enforced order (the central planner making choices).

All three of these mechanisms work in different contexts at different levels.  Historically, much of the literature has argued in favor of centralization closer to the point of decision and decentralization further out—for example, look at Coase’s The Nature of the Firm.  In his work, Coase works to explain why firms exist, and his argument is that it is more efficient for firms to internalize some transaction costs.  His work also tries to explain when firms tend to be hierarchical in nature (closer to central planning) versus market-oriented or commons-oriented—though a careful reading of Coase shows that even within a firm, spontaneous orders can emerge and can be efficient…but now I’m going down a completely different track.

Let’s Get Back To Computers

Okay, so we have three working theories of how to manage resources in a commons scenario.  Let’s shift our commons scenario around just a little bit:  now our common resource is a database server.  We have a certain number of CPU cores, a certain amount of RAM, a certain amount of disk, a certain disk throughput, and a certain network bandwidth.  Instead of shepherds grazing, we have product teams using the database.

Each product team makes use of the database server differently, where product team A makes x calls and uses f(x) resources, product team B makes y calls and uses f(y) resources, and product team C makes z calls and uses f(z) resources, again, where 0 <= f(x) / f(y) / f(z) <= 1.  We hope that f(x) + f(y) + f(z) < 1; if that’s the case, we don’t have a resource allocation problem:  nobody has saturated the database server and all is well.  But realistically, most of us end up with resource contention, where every team pushes more and more product out and eventually team desires overwhelm server resources.  In this case, the equivalent of overgrazing is resource saturation, which leads to performance degradation across the board:  when CPU is at 100% or disk queue keeps growing and growing, everybody suffers, not just the marginal over-user.

As a quick note, “1” in this case is a composite measure of multiple resources:  CPU, memory, disk capacity, disk throughput, and network throughput.  More realistically, f(x) describes a vector, each element of which corresponds to a specific resource and each element lies between 0 and 1.  For the sake of simplicity, I’m just going to refer to this as one number, but it’s certainly possible for a team to kill a server’s CPU while hardly touching disk (ask me how I know).

Thinking About Implications

In my experience, there are two factors which have led to servers being increasingly more like a commons:  Operations’s loss of power and Agile product teams.

Gone are the days when an Operations team had a serious chance of spiking a project due to limited resources.  Instead, they’re expected to make do with what they have and find ways to say yes.  Maybe it’s always been like this—any curmudgeonly sysadmins want to chime in?—but it definitely seem to be the case now.

On the other side, the rise of Agile product teams has turned server resources into a commons problem for two reasons.  First, far too many people interpret “Agile” as “I don’t need to plan ahead.”  So they don’t plan ahead, instead just assuming that resource planning is some combination of premature optimization and something to worry about when we know we have a problem.

Second, Agile teams tend to have their own management structure:  teams run on their own with their own product owners, their own tech leads, and their own hierarchies and practices.  Those tech leads report to managers in IT/Engineering, sure, but they tend to be fairly independent.  There often isn’t a layer above the Agile team circumscribing their resource utilization, and the incentives are stacked for teams to behave with the servers like our example shepherds did with the common grazing field:  they focus on their own needs, often to the exclusion of the other shepherds and often to everybody’s detriment.

Thinking About Solutions

Given the server resource problem above, how would each of the three solutions we’ve covered apply to this problem?  Let’s look at them one by one, but in a slightly different order.

Central Planning

This is the easiest answer to understand.  Here, you have an architect or top-level manager or somebody with authority to determine who gets what share of the resources.  When there’s resource contention, the central planner says what’s less important and what needs to run.  The central planner often has the authority to tell product teams what to do, including having different teams spend sprint time to make their applications more efficient.  This is the first solution people tend to think about, and for good reason:  it’s what we’re used to, and it works reasonably well in many cases.

The problem with this solution (again, excluding public choice issues, incomplete information issues, and a litany of other, similar issues central planning has) is that typically, it’s management by crisis.  Lower-level teams act until they end up causing an issue, at which point the central planner steps in to determine who has a better claim on the resources.  Sometimes these people proactively monitor and allocate resources, but typically, they have other, more important things to do and if the server’s not on fire, well, something else is, so this can wait.

Commons Rules

In the “rules of the commons” solution, the different product teams work together with Operations to come up with rules of the game.  The different teams can figure out what to do when there is resource over-subscription, and there are built-in mechanisms to reduce over-utilization.  For example, Operations may put different teams into different resource groups, throttling the biggest users during a high-utilization moment.  Operations can also publish resource utilization by team, showing which teams are causing problems.

Let’s give two concrete examples of this from work.  First, each team a separate application name that they use in SQL Server connection strings.  That way, our Database Operations group can use Resource Governor to throttle CPU, memory, and I/O utilization by team.  This reduces some of the contention in some circumstances, but it doesn’t solve all of the issues:  assuming that there is no cheating on the parts of teams, there are still parts of the application we don’t want to throttle—particularly, those parts of the application which drive the user interface.  It’s okay to throttle back-end batch operations, but if we’re slowing down the user experience, we are in for trouble.

Another example of how these rules can work is in Splunk logging.  We have a limited amount of logging we can do to Splunk per day, based on our license.  Every day, there is a report which shows how much of our Splunk logging we’ve used up in the prior day, broken down by team.  That way, if we go over the limit, the teams with the biggest logs know who they are and can figure out what’s going on.

The downside to commons rules is similar to that of the central planner:  teams don’t think about resource utilization until it’s too late and we’re struggling under load.  Sure, Operations can name and shame, but the incentives are stacked in such a way that nobody cares until there’s a crisis.  If you’re well under your logging limit, why work on minimizing logging when there are so many other things on the backlog?

Next Time:  The Price Signal

Despite the downsides I listed, the two systems above are not failures when it comes to IT/Engineering organizations.  They are both viable options, and I’m not saying to ditch what you currently do.  In the next post, I will cover the price option and spend an inordinate number of words writing about marketplaces in computing.

Temporal Tables As SCD2 Tables

In this post, I am going to look at using SQL Server 2016 temporal tables as slowly changing dimension tables.  We will walk through the setup and then look at what I consider to be the biggest disappointment with temporal tables; I will then show a workaround for this issue.

Act 1:  The Setup

Temporal tables were introduced in SQL Server 2016.  They take a lot of the busy work out of creating history tables, handling things like data movement and building interfaces to query historical data for you.  This makes them very interesting from a T-SQL developer’s perspective, as it potentially allows us to keep our primary table smaller while still retaining historical data for user query.

In my example, I am going to build a table to predict quantity sold for a series of products.  In the real version of these tables, I have some fancy R code generating product models, and I store them in a table called ProductModel.  I want to have the latest version of the model readily available to me, but I would also like to store historical models for a product; that way, if I see a persistent problem with the model, I can load an older version of the model from SQL Server back into R and analyze the model further.

For this post, I’m going to do something a bit simpler; I’ll create a ProductModel table with a product ID and a model number.  This model number represents my R model and allows me to simplify the problem considerably by skipping the whole model creation part.

Here’s my ProductModel table:

CREATE TABLE [dbo].[ProductModel]
(
	ProductID INT NOT NULL,
	ModelNumber INT NOT NULL,
	StartDateGMT DATETIME2(3) GENERATED ALWAYS AS ROW START CONSTRAINT [DF_ProductModel_StartDateGMT] DEFAULT (SYSUTCDATETIME()),
	EndDateGMT DATETIME2(3) GENERATED ALWAYS AS ROW END CONSTRAINT [DF_ProductModel_EndDateGMT] DEFAULT(SYSUTCDATETIME()),
	PERIOD FOR SYSTEM_TIME (StartDateGMT, EndDateGMT),
	CONSTRAINT [PK_ProductModel] PRIMARY KEY CLUSTERED
	(
		ProductID
	)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE)
) WITH(SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.ProductModelHistory));
GO

Note that I am creating a history table called dbo.ProductModelHistory, and the valid date range for each model will be from StartDateGMT until EndDateGMT.

Next up, I want to create a table to store my predictions.  In this table, I store the product ID, the date when I made the prediction, and how much I predicted we would sell of the product.  Note that this table does not have a model number:  the people who care about the predictions don’t have any idea what an R model is and don’t care how we store that model.

CREATE TABLE [dbo].[QuantitySoldPrediction]
(
	ProductID INT NOT NULL,
	DatePredictionMade DATE NOT NULL,
	PredictedQuantitySold INT NOT NULL,
	CONSTRAINT [PK_QuantitySoldPrediction] PRIMARY KEY CLUSTERED
	(
		ProductID,
		DatePredictionMade
	)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE)
);
GO

Now I want to populate these tables. Because I want to load historical data, I need to set system versioning off temporarily.

--Insert some historical data
ALTER TABLE dbo.ProductModel SET (SYSTEM_VERSIONING = OFF);

INSERT INTO dbo.ProductModelHistory
(
	ProductID,
	ModelNumber,
	StartDateGMT,
	EndDateGMT
)
VALUES
(1, 1, '2017-06-01', '2017-06-14'),
(1, 2, '2017-06-14', '2017-06-28'),
(1, 3, '2017-06-28', '2017-07-03 14:29:21'),
(2, 1, '2017-07-01', '2017-07-04 08:00:00');

ALTER TABLE dbo.ProductModel SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.ProductModelHistory));

INSERT INTO dbo.ProductModel
(
	ProductID,
	ModelNumber
)
VALUES
(1, 4),
(2, 2),
(3, 1);

--Insert predictions, some for the historical data and some for now.
INSERT INTO dbo.QuantitySoldPrediction
(
	ProductID,
	DatePredictionMade,
	PredictedQuantitySold
)
VALUES
(1, '2017-06-02', 8),
(1, '2017-06-15', 11),
(1, '2017-06-19', 9),
(1, '2017-06-29', 21),
(1, '2017-07-02', 23),
(2, '2017-07-02', 2),
(2, '2017-07-03', 1),
(1, GETUTCDATE(), 44),
(2, GETUTCDATE(), 6),
(3, GETUTCDATE(), 16);

At this point, we have the following:

  • Three products in our ProductModel table
  • Four historical ProductModel records.  Note that there is a gap between historical end dates and the start dates in the ProductModel table; that’s okay for what we’re doing.
  • Ten predictions, including seven historical predictions and three current predictions.

Act Two:  The Falling Action

Now that I have my data, I want to do something simple:  I want to show the model number associated with each prediction.  This involves looking at the ProductModel and ProductModelHistory tables based on StartDateGMT.

The documentation for temporal tables shows how you can use FOR SYSTEM_TIME AS OF to look at time frames.  For example:

SELECT
	pm.ProductID,
	pm.ModelNumber,
	pm.StartDateGMT,
	pm.EndDateGMT
FROM dbo.ProductModel FOR SYSTEM_TIME AS OF '2017-07-02 08:00:00' pm;

This brings us back the following result:

ProductModelTemporal

We can also use a variable, allowing us to look at what the table looked like at a user-defined point in time:

DECLARE
	@InterestingTime DATETIME = '2017-07-02 08:00:00';

SELECT
	pm.ProductID,
	pm.ModelNumber,
	pm.StartDateGMT,
	pm.EndDateGMT
FROM dbo.ProductModel FOR SYSTEM_TIME AS OF @InterestingTime pm;

This gives us back the same result. Thinking about this syntax, what I want to do is join to dbo.QuantitySoldPrediction table. Let’s start simple:

DECLARE
	@InterestingTime DATETIME = '2017-07-02 08:00:00';

SELECT
	pm.ProductID,
	pm.ModelNumber,
	pm.StartDateGMT,
	pm.EndDateGMT
FROM dbo.ProductModel FOR SYSTEM_TIME AS OF @InterestingTime pm
	INNER JOIN dbo.QuantitySoldPrediction qsp
		ON pm.ProductID = qsp.ProductID;

This query succeeds but returns results we don’t really want:

ProductModelTemporalSameDate

This brings back all 9 records tied to products 1 and 2 (because product 3 didn’t exist on July 2nd at 8 AM UTC). But it gives us the same start and end date, so that’s not right. What I really want to do is replace @InterestingTime with qsp‘s DatePredictionMade, so let’s try that:

ProductModelTemporalInvalid

This returns a syntax error. It would appear that at the time FOR SYSTEM_TIME is resolved, QuantitySoldPrediction does not yet exist. This stops us dead in our tracks.

Act Three: The Denouement

We don’t have an easy way of solving the problem, but we do have a complicated way. Under the covers, dbo.ProductModelHistory is just a table. That means we can still query it like any other table. What we want is the latest single product model date that fits within each quantity sold prediction’s date. To solve this, we pull one of my favorite rabbits out of the hat: the APPLY operator.

SELECT
	q.ProductID,
	q.DatePredictionMade,
	q.PredictedQuantitySold,
	pm.ModelNumber
FROM dbo.QuantitySoldPrediction q
	CROSS APPLY
	(
		SELECT TOP(1)
			pmbuild.ModelNumber
		FROM
		(
			SELECT
				pm.ModelNumber,
				pm.StartDateGMT
			FROM dbo.ProductModel pm
			WHERE
				pm.ProductID = q.ProductID

			UNION ALL

			SELECT TOP(1)
				pmh.ModelNumber,
				pmh.StartDateGMT
			FROM dbo.ProductModelHistory pmh
			WHERE
				pmh.ProductID = q.ProductID
				AND pmh.StartDateGMT <= q.DatePredictionMade
			ORDER BY
				pmh.StartDateGMT DESC
		) pmbuild
		ORDER BY
			pmbuild.StartDateGMT ASC
	) pm;

This query is quite a bit longer than I’d like, as I need to build a sub-query to union together current and historical data, and then grab the top row for each quantity sold prediction. On the plus side, the results are what I want:

ProductModelTemporalVictory

This is one of two reasons that make me think that temporal tables do not make for great slowly changing dimension implementations.  The other reason is that you can’t insert into the history table as long as you have SYSTEM_VERSIONING on.  These tables can be quite useful for storing fact table history, as the FOR SYSTEM_TIME AS OF point-in-time snapshot makes a lot more sense in that case.  And as we saw above, if you want to try hard enough when writing your queries to retrieve data, you can make them work.  In my case, I had one query like this I needed to write, so the benefits still outweighed the costs.

Bonus Evolutionary Algorithm

This is part of a series entitled Genetics In Action.

To wrap up this series on evolutionary algorithms, I want to provide you with one final algorithm, the Flopping Fish Algorithm. This is not a particularly good algorithm at all, but it is one that I’ve seen in practice.

Let’s start by taking a genetic algorithm. As you’ll recall from prior posts in the series, genetic algorithms have a pair of important modification rates: the crossover rate and the mutation rate. The crossover rate gives a genetic algorithm its power, whereas the mutation rate helps kick us out of local minima/maxima. The crossover rate should be fairly high (60-90% is a reasonable range) whereas mutation should be low (1% or so is a nice start, though in certain extreme cases, it might go as high as 20-30%).

In contrast to the genetic algorithm rates, let’s go the opposite way: have a mutation rate above 50% and a crossover rate of 10% or less. As an aside, I’m picking these numbers because those were actually numbers I once found in a production system.

So what does this high-mutation, low-crossover search pattern look like? Flopping fish. Let’s motivate this with a story.

Imagine that you have a two-dimensional board of some arbitrarily large size. Each spot on the board is shaded somewhere between entirely white and entirely red. Your goal is to land on the reddest part of the board.

With a genetic algorithm, you build a fitness function where the output would be the amount of red on the board. For every (x, y) combo, you get an R value somewhere in the range of [ 0, 255 ]. You start out with a random sample of squares on the board and splice x and y values, and your strongest organisms tend to move in a semi-orderly fashion toward those darker-shaded spots. By the end of the run, you hope to have landed on the darkest spot.

By contrast, the flopping fish algorithm has you start off with a bucket of fish. You pour them onto the board and watch them as they flop around. Their flopping is not entirely aimless, but there is little rhyme or reason to the motions. Eventually, the fish stop flopping and you collect the points at which they met their doom, choosing the darkest-hued spot as your solution.

The flopping fish algorithm is a waystation between a genetic algorithm (which you might consider controlled “randomness”) and random selection of points. It is a little different from pulling a bunch of random numbers from thin air—with the flopping fish, each fish can move only so far in its gasps—but the effect is about the same. A genetic algorithm with too large a mutation rate effectively ignores the signals that the fitness function provides—and remember, those signals are the *only* signals the algorithm gets in order to tell if it’s on the right track.

If you’re going to run a genetic algorithm with an extremely high mutation rate and extremely low crossover rate, save yourself the trouble and just pluck random numbers from thin air; that’ll save you time and effort.

Genetic Programming: DBA Salaries

This is part of a series entitled Genetics In Action.

Today’s post shows off another variant of genetic programming.  Last time around, we looked at mathematical functions using rgp.  Today, we are going to build conditional programs using evtree, another evolutionary library in R.

Time To Buy Me A DBA

The data set we will use is the survey of data professional salaries that Brent Ozar put together in 2016.  I’ve already looked at the data once, but now I want to apply a different approach:  I want to see how well I can explain and predict variance in DBA salaries.

Starting out, I’m going to load three packages:  evtree (which lets me build evolutionary trees), XLConnect (which lets me read Excel files), and tidyverse (which is the Swiss army knife of R packages):

if(!require(evtree)) {
    install.packages("evtree", repos = "http://cran.us.r-project.org")
    library(evtree)
}

if(!require(XLConnect)) {
  install.packages("XLConnect", repos = "http://cran.us.r-project.org")
  library(XLConnect)
}

if(!require(tidyverse)) {
  install.packages("tidyverse", repos = "http://cran.us.r-project.org")
  library(tidyverse)
}

Let’s pull in the salary data using XLConnect.

wb <- loadWorkbook("Data/2017_Data_Professional_Salary_Survey_Responses.xlsx")
salary_data <- readWorksheet(wb, sheet = "Salary Survey", region = "A4:T2902")

Next up, I would like to look at the structure of the salary data, which I can do easily by running str(salary_data).

We have a sample of approximately 3000 data professionals, as well as 18 variables per observation—I’m ignoring Timestamp and Counter, as they are not relevant to the discussion.

What To Do, What To Include

Last time, I looked at purchasing power parity to try to find countries where data professionals are relatively better off. This time around, I’m going to limit my search to data professionals in the United States. Even with the United States, there are differences in PPP—after all, a DBA living in San Francisco or New York City making $100K a year is quite a different story than a DBA in Peoria or Richmond making $100K a year—but there’s a smaller spread. It’s a spread that I cannot control, so I’m expecting that I won’t be able to explain a healthy chunk of the variance, but we proceed apace.

Next, I will focus only on those who selected Microsoft SQL Server as their primary database. I don’t want to deal with level differences between Oracle, DB2, MySQL, and SQL Server. Also, a big majority of respondents selected SQL Server, so there’s probably some bias in the non-SQL Server participants.

After that, I want to include only full-time employees, filtering out part-time, freelance, or contract employees. That’s not a value judgment, but it does constrain my sample a bit further.

Finally, I want to filter out anybody making at least $300K a year. There are only a couple of those in the data set and they skew the results pretty hard. If you’re making $300K+ a year as a full-time employee, great; but you’re so atypical a case that I don’t care about you. Alternatively, at least one of those results might be a typo, but we’re not sure.

Now that I’ve explained the filters, I want to discuss my interesting variables. I want to explain SalaryUSD using five variables: YearsWithThisDatabase, HoursWorkedPerWeek, Education, EmploymentSector, and ManageStaff. The other variables are probably helpful as well, but for my demo, I want to keep the number of variables in play relatively low.

Now that I’ve discussed what I plan to do, let’s actually do it in many fewer lines in R:

salary_pred <- salary_data %>%
  filter(Country == "United States") %>%
  filter(PrimaryDatabase == "Microsoft SQL Server") %>%
  filter(EmploymentStatus == "Full time employee") %>%
  filter(SalaryUSD <= 300000) %>%
  select(SalaryUSD, YearsWithThisDatabase, HoursWorkedPerWeek, Education, EmploymentSector, ManageStaff)

salary_pred$Education <- as.factor(salary_pred$Education)
salary_pred$EmploymentSector <- as.factor(salary_pred$EmploymentSector)
salary_pred$ManageStaff <- as.factor(salary_pred$ManageStaff)

I now have a data set called salary_pred which includes the interesting variables. After putting in all of my filters, I still have 1624 observations, so I’m satisfied that I should have enough data to find something interesting.

Something Interesting

I am now going to build a regression tree with evtree. A regression tree is a fairly simple idea. It’s a decision tree, except that we build a regression for each leaf node. The syntax for building a genetic program which gives us a regression tree is pretty simple:

ev <- evtree(SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + Education + EmploymentSector + ManageStaff,
             data = salary_pred, minbucket = 10, maxdepth = 4)

This churns for a little bit and then, when it’s done, we can call plot(ev) to get the following image:

This image is a little busy; I actually find it easier to read the results by running ev instead:

Model formula:
SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + Education + 
    EmploymentSector + ManageStaff

Fitted party:
[1] root
|   [2] YearsWithThisDatabase < 8
|   |   [3] YearsWithThisDatabase < 5: 73507.922 (n = 244, err = 117381058953.5) | | [4] YearsWithThisDatabase >= 5: 90571.821 (n = 274, err = 126256913002.2)
|   [5] YearsWithThisDatabase >= 8
|   |   [6] EmploymentSector in Local government, State/province government, Student: 91592.034 (n = 117, err = 53016079035.9)
|   |   [7] EmploymentSector in Federal government, Private business
|   |   |   [8] ManageStaff in No
|   |   |   |   [9] YearsWithThisDatabase < 16: 105536.492 (n = 453, err = 253588315377.2) | | | | [10] YearsWithThisDatabase >= 16: 117556.179 (n = 285, err = 172506498805.9)
|   |   |   [11] ManageStaff in Yes: 123739.729 (n = 251, err = 200634714045.6)

Number of inner nodes:    5
Number of terminal nodes: 6

Let’s sum this up in a list, where each selection is exclusive of the others:

  1. If you have spent less than 5 years as a SQL Server professional, we expect your salary to be around $73,507.
  2. If you have spent between 5 and 8 years as a SQL Server professional, we expect your salary to be around $90,571.
  3. If you have spent at least eight years as a SQL Server professional and you are in local/state government or you are a student, we expect your salary to be around $91,592.
  4. If you are not a manager and have spent 8-16 years as a SQL Server professional, we expect your salary to be around $105,536.
  5. If you are not a manager and have spent at least 16 years as a SQL Server professional, we expect your salary to be around $117,556.
  6. If you are a manager and have spent at least 8 years as a SQL Server professional, we expect your salary to be around $123,739.

On the plus side, each of these nodes has at least 100 members, and most have 250+, so it’s a fairly broad distribution in each. On the minus side, the error terms seem rather large.

Thinking Linearly

To give a contrast to our regression tree, let’s try a simple linear regression.

salary_lm <- lm(SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + Education + EmploymentSector + ManageStaff,
                data = salary_pred)

There’s nothing fancy to this regression, and when I call summary(salary_lm), I get the following:

Call:
lm(formula = SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + 
    Education + EmploymentSector + ManageStaff, data = salary_pred)

Residuals:
    Min      1Q  Median      3Q     Max 
-110294  -15585   -1307   12911  144645 

Coefficients:
                                           Estimate Std. Error t value Pr(>|t|)
(Intercept)                                67621.12    6151.45  10.993  < 2e-16
YearsWithThisDatabase                       2262.35     100.98  22.403  < 2e-16
HoursWorkedPerWeek                           155.67      93.77   1.660 0.097090
EducationBachelors (4 years)                7288.25    1958.67   3.721 0.000205
EducationDoctorate/PhD                     17428.29    8201.28   2.125 0.033733
EducationMasters                           12071.38    2279.88   5.295 1.36e-07
EducationNone (no degree completed)         4931.55    2375.89   2.076 0.038083
EmploymentSectorLocal government          -21701.26    5074.87  -4.276 2.01e-05
EmploymentSectorPrivate business           -5810.89    4357.09  -1.334 0.182502
EmploymentSectorState/province government -23477.37    4985.01  -4.710 2.70e-06
EmploymentSectorStudent                   -29033.34   10683.44  -2.718 0.006646
ManageStaffYes                             11518.96    1452.14   7.932 3.99e-15
                                             
(Intercept)                               ***
YearsWithThisDatabase                     ***
HoursWorkedPerWeek                        .  
EducationBachelors (4 years)              ***
EducationDoctorate/PhD                    *  
EducationMasters                          ***
EducationNone (no degree completed)       *  
EmploymentSectorLocal government          ***
EmploymentSectorPrivate business             
EmploymentSectorState/province government ***
EmploymentSectorStudent                   ** 
ManageStaffYes                            ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23930 on 1612 degrees of freedom
Multiple R-squared:  0.3228,	Adjusted R-squared:  0.3182 
F-statistic: 69.86 on 11 and 1612 DF,  p-value: < 2.2e-16

What we see is that the linear model explains approximately 32% of the variance in salaries. As I mentioned above, I believe that relative purchasing power plays a significant role in explaining variance, but because we don’t have the data at that grain, I cannot test this conjecture.

That aside, we can see some interesting contrasts. Note that the linear model picked up local/state government and student as major negatives, knocking off at least $21K and calling local/state government statistically significant. The linear model, meanwhile, does not consider hours worked per week significant, either statistically or in terms of power. Each extra hour of work per week leads to $155 in salary per year. I’d hazard a guess that this is because most full-time data professionals are salaried employees.

Let’s Compare

Now that we have a genetic model and a linear model, we want to compare the two and see which turns out better.

First, let’s use our models to figure out predictions over the data set. Then I want to calculate the root mean square deviance for each of the two.

salary_pred$EVPrediction <- predict(ev)
salary_pred$LMPrediction <- predict(salary_lm)

sqrt(mean((salary_pred$SalaryUSD - salary_pred$EVPrediction)^2))
sqrt(mean((salary_pred$SalaryUSD - salary_pred$LMPrediction)^2))

The evolutionary regression tree has an RMSD of $23,845 whereas the linear regression has an RMSD of $23,841. In other words, they’re practically the same.

But let’s not allow aggregates to conceal the important mechanisms of each model. Let’s see how each predicts for different people.

The Lives Of Others (But Mostly Me)

In this test, I want to build up three test cases. The first test is a new graduate, right out of school with a 4-year Bachelor’s. The second test is me, after having spent 5 years working for the state. The final test is me today.

test_cases <- data.frame(
    YearsWithThisDatabase = c(0, 5, 11),
    HoursWorkedPerWeek = c(40, 40, 40),
    Education = c("Bachelors (4 years)", "Masters", "Masters"),
    EmploymentSector = c("Private business", "State/province government", "Private business"),
    ManageStaff = c("No", "No", "Yes")
)
YearsWithThisDatabase HoursWorkedPerWeek Education EmploymentSector ManageStaff
0 40 Bachelors (4 years) Private business No
5 40 Masters State/province government No
11 40 Masters Private business Yes

We can see how each of the models would predict the people in this sample.

predict(ev, test_cases)
predict(salary_lm, test_cases)

In the first case, for the new person, both models are in harmony: the evolutionary regression tree came up with a value of $73,507 and the linear regression was $75,325.

In the second case, the evolutionary regression tree came up with $90,571 and the linear regression was $73,753. That’s a huge difference between the two!

In the third case, we see another fairly big difference: $123,739 for the evolutionary model versus $116,513 for the linear model.

Even though the two models have the same top-line deviance from the mean, it’s clear which model I prefer!

Conclusion

This wraps up genetic programming.  There’s a lot more to the topic—after all, John Koza wrote numerous books on it!—but hopefully that foments some consideration on where it might be useful to use a genetic programming algorithm.

Genetic Programming: Mathematical Functions

This is part of a series entitled Genetics In Action.

In today’s post, I am going to look at two examples of deriving a mathematical function.  I will use the rgp package in R.  It appears that nobody is maintaining the rgp package anymore, so this isn’t something I’d use in a production scenario, but it does offer us a fairly easy introduction.

Tell Me What I Know

In this first example, I will look at a case where we hope to see the genetic program approximate a known function as closely as possible. You can think of this as a supervised test, where we have a desired end state and want to see how well the genetic program’s results fare in comparison.

A Craftsman Never Blames His Tools

if(!require(rgp)) {
    install.packages("rgp", repos = "http://cran.us.r-project.org")
    library(rgp)
}

function_set <- functionSet("+", "*", "-")
input_variable_set <- inputVariableSet("x")
constant_factory_set <- constantFactorySet(function() { rnorm(1) })

In this first bit of code, we load the rgp package and set up some of our constraints. I want to derive a mathematical formula using only the addition, multiplication, and subtraction operators. As a note, division can be troublesome in a genetic program because of the chance of dividing by zero.

Aside from mathematical operators, we have one independent variable in the mix: x. Now we need to figure out our fitness function.

function_interval <- seq(from = -pi, to = pi, by = 0.1)
fitness_function <- function(f) { rmse(f(function_interval), sin(function_interval)) }

Our end goal is a sine wave that runs from -pi to pi. Our fitness function is the Root Mean Square Error of our generated function versus the ideal of the sine wave, meaning that we want to minimize the result of our fitness function, with a value of 0 matching the sine wave exactly. Now that we have everything in place, let’s kick off the process.

result <- geneticProgramming(
  functionSet = function_set,
  inputVariables = input_variable_set,
  constantSet = constant_factory_set,
  fitnessFunction = fitness_function,
  stopCondition = makeTimeStopCondition(20) #run for 20 seconds
)

This will run for 20 seconds, during which time you will hopefully find functions which converge closer and closer to the ideal. In my case, it looked like this:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 0.706164, time elapsed: 0.84 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 0.445428, time elapsed: 1.4 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 0.445256, time elapsed: 2.42 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 0.445256, time elapsed: 3.53 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 0.445256, time elapsed: 4.32 seconds
evolution step 600, fitness evaluations: 29950, best fitness: 0.445256, time elapsed: 5.26 seconds
evolution step 700, fitness evaluations: 34950, best fitness: 0.445256, time elapsed: 6.15 seconds
evolution step 800, fitness evaluations: 39950, best fitness: 0.445256, time elapsed: 6.75 seconds
evolution step 900, fitness evaluations: 44950, best fitness: 0.445210, time elapsed: 7.59 seconds
evolution step 1000, fitness evaluations: 49950, best fitness: 0.445210, time elapsed: 8.29 seconds
evolution step 1100, fitness evaluations: 54950, best fitness: 0.445210, time elapsed: 8.9 seconds
evolution step 1200, fitness evaluations: 59950, best fitness: 0.445210, time elapsed: 9.81 seconds
evolution step 1300, fitness evaluations: 64950, best fitness: 0.445210, time elapsed: 12.35 seconds
evolution step 1400, fitness evaluations: 69950, best fitness: 0.445210, time elapsed: 13.09 seconds
evolution step 1500, fitness evaluations: 74950, best fitness: 0.445210, time elapsed: 14.81 seconds
evolution step 1600, fitness evaluations: 79950, best fitness: 0.445198, time elapsed: 15.48 seconds
evolution step 1700, fitness evaluations: 84950, best fitness: 0.445198, time elapsed: 16.17 seconds
evolution step 1800, fitness evaluations: 89950, best fitness: 0.445198, time elapsed: 17.29 seconds
evolution step 1900, fitness evaluations: 94950, best fitness: 0.445198, time elapsed: 18.01 seconds
evolution step 2000, fitness evaluations: 99950, best fitness: 0.445198, time elapsed: 18.68 seconds
evolution step 2100, fitness evaluations: 104950, best fitness: 0.445198, time elapsed: 19.31 seconds
Genetic programming evolution run FINISHED after 2138 evolution steps, 106850 fitness evaluations and 20 seconds.

There was some convergence, but it wasn’t the best. Let’s take a look at the best program:

best_solution <- result$population[[which.min(result$fitnessValues)]]
best_solution
min(result$fitnessValues)

That spits back the following:

function (x) 
0.305371214275836 * (x - -0.0445197474230304)
0.445198186160344

The reduced form function is 0.305x + 0.013595. That’s a linear function with respect to x, and the RMSE is 0.445, which is not great.

Here’s a plot of our solution versus the ideal solution:

plot(y = best_solution(function_interval), x = function_interval, type = "l", lty = 1, xlab = "x", ylab = "y")
lines(y = sin(function_interval), x = function_interval, lty = 2)

Genetic program with a one-track mind.

A Craftsman Goes Out And Gets Better Tools

That first result wasn’t very good, and part of the problem is that we did not give the genetic program the right tools to figure out a solution to our problem. So let’s change the function set just a little bit:

function_set <- functionSet("+", "*", "-", "cos")

You and I know that cosine and sine have a direct relationship: shift sine by pi/2 and you get cosine. Let’s see if the genetic program figures out the same. This time around, I ran the function for 7 seconds, and we can see the improvement:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 0.458847, time elapsed: 0.64 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 0.445672, time elapsed: 1.32 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 0.124575, time elapsed: 2 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 0.003608, time elapsed: 2.64 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 0.003608, time elapsed: 4.41 seconds
evolution step 600, fitness evaluations: 29950, best fitness: 0.003608, time elapsed: 5.22 seconds
evolution step 700, fitness evaluations: 34950, best fitness: 0.003608, time elapsed: 5.97 seconds
evolution step 800, fitness evaluations: 39950, best fitness: 0.003608, time elapsed: 6.71 seconds
Genetic programming evolution run FINISHED after 840 evolution steps, 41950 fitness evaluations and 7 seconds.

For the first couple seconds, the algorithm wasn’t doing much better than it had last time around, but then something clicked and the best solution got a lot better. Let’s take a look at it.

best_solution <- result$population[[which.min(result$fitnessValues)]]
best_solution
min(result$fitnessValues)
function (x) 
cos(x - 0.291574502599581 - 1.2741260116152)
0.00360807823117699

In other words, this is cos(x – 1.5657), which is very close to pi/2. If we plot the curve, we’ll see an almost-exact match.

plot(y = best_solution(function_interval), x = function_interval, type = "l", lty = 1, xlab = "x", ylab = "y")
lines(y = sin(function_interval), x = function_interval, lty = 2)

Turns out cosine and sine are actually related.

Finding The Best Tools

We can take this one step further and have the function set include all basic math + trigonometric functions like so:

function_set <- trigonometricFunctionSet

In rgp, this includes the set { sin, cos, tan }. When we run the function with this set, it finds the best solution almost immediately: sin(x).

Tell Me What I Don’t Know

Now we’re going to shake things up a bit. Instead of trying to match a function, we will take a sample of data collected in the wild and attempt to fit a function to it.

if(!require(rgp)) {
    install.packages("rgp", repos = "http://cran.us.r-project.org")
    library(rgp)
}

if(!require(igraph)) {
  install.packages("igraph", repos = "http://cran.us.r-project.org")
  library(igraph)
}

We will still use rgp, but I’m also going to introduce igraph, which allows us to visualize graphs easily.

Unlike before, we don’t have a function in mind; rather, we have some x-y coordinates.

input_data <- data.frame(
    x = c(1, 2, 3, 4, 5, 6, 7),
    y = c(0, 26, 88, 186, 320, 490, 541)
)

If we use plot(x,y) to plot this, we get the following image:

Did you know that disco record sales were up 400% for the year ending 1976?  If these trends continue…AY!

Let’s turn this set of numbers into a plausible example. Let’s say we have a rocket car we’re testing, and the y axis represents speed in miles per hour, whereas the x axis represents seconds. What we see is an immediate burst and exponential growth for the first few seconds, but then we hit an inflection point and start tailing off near the end. If we drew this out for a few more seconds, we’d probably see the y value flatline somewhere under 600.

Now that we have our just-so story in place, let’s build our process. We will use the basic arithmetic operators and perform a symbolic regression of y against x, running for 500 steps.

function_set <- functionSet("+", "-", "*")
result <- symbolicRegression(
    y ~ x,
    data = input_data,
    functionSet = function_set,
    stopCondition = makeStepsStopCondition(500)
)

We could run for a specific period of time as well; rgp has several stop conditions available. Anyhow, here’s how the results look in my particular run:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 33.007936, time elapsed: 3.75 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 33.007936, time elapsed: 7.18 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 33.007936, time elapsed: 10.36 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 33.007936, time elapsed: 13.53 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 33.007936, time elapsed: 16.9 seconds
Genetic programming evolution run FINISHED after 500 evolution steps, 24950 fitness evaluations and 16.9 seconds.

Looks like we hit a pretty decent result early but couldn’t improve on it in the allocated number of operations. Here’s what our result looks like:

plot(input_data$y, col=1, type="l")
points(predict(result, newdata = input_data), col=2, type="l")

No, don’t worry, we can forecast outside of the prediction range no problem!

And here’s the formula that our best result generates:

best_result <- result$population[[which.min(result$fitnessValues)]]
best_result
8.87532839551568 * (-1.28207603423574 * -1.08413319016388) * x * x

Simplifying that, we have approximately 12.336x^2. That’s a pretty good fit for the data set, though it does not capture the falloff near the end and would perform poorly in our out-of-sample data (assuming, again, that subsequent results continue the late trend of trailing off toward a maximum value under 600).

1000 Words Coming Up

I’d like to display the result as a graph, because now that I’m in management, I find myself incapable of understanding anything that does not have a picture associated with it. The rgp package has a nice function called funcToIgraph() built in, which allows you to turn best_result into a graph that igraph can read and display. There’s a bug in rgp, though: when rgp was built, igraph used 0-indexed notation, but they subsequently shifted to 1-based notation. Therefore, rgp has an off-by-one bug. To get around this, I wrote my own function that I can use in this demo:

exprToIgraph <- function(expr) {
  if (!require("igraph")) stop("exprToIgraph: Package 'igraph' not installed.")
  exprGraph <- exprToGraph(expr)
  exprIgraph <- graph(exprGraph$edges, n = length(exprGraph$vertices)) # igraph vertexes used to be counted from zero but no longer are!
  V(exprIgraph)$label <- exprGraph$vertices
  exprIgraph
}

I’d happily submit it as a pull request, but I could only find mirrors and forks of rgp and not any active repository.

Anyhow, let’s satisfy the manager inside and turn this formula into a picture:

final_graph <- exprToIgraph(body(best_result))
plot(final_graph, layout=layout_as_tree, vertex.color="orange", vertex.size=30, edge.arrow.size=0.1)

Somebody out there is mentally picturing this graph in Reverse Polish Notation.

There we go; now I’m rejuvenated and ready to demand some more TPS reports.

Conclusion

Today’s post was a look at a pair of genetic programs which derive mathematical functions. This is one of the main uses of genetic programming, but it’s not the only one. In tomorrow’s post, I will look at another major use case.