Genetic Programming: Mathematical Functions

This is part of a series entitled Genetics In Action.

In today’s post, I am going to look at two examples of deriving a mathematical function.  I will use the rgp package in R.  It appears that nobody is maintaining the rgp package anymore, so this isn’t something I’d use in a production scenario, but it does offer us a fairly easy introduction.

Tell Me What I Know

In this first example, I will look at a case where we hope to see the genetic program approximate a known function as closely as possible. You can think of this as a supervised test, where we have a desired end state and want to see how well the genetic program’s results fare in comparison.

A Craftsman Never Blames His Tools

if(!require(rgp)) {
    install.packages("rgp", repos = "http://cran.us.r-project.org")
    library(rgp)
}

function_set <- functionSet("+", "*", "-")
input_variable_set <- inputVariableSet("x")
constant_factory_set <- constantFactorySet(function() { rnorm(1) })

In this first bit of code, we load the rgp package and set up some of our constraints. I want to derive a mathematical formula using only the addition, multiplication, and subtraction operators. As a note, division can be troublesome in a genetic program because of the chance of dividing by zero.

Aside from mathematical operators, we have one independent variable in the mix: x. Now we need to figure out our fitness function.

function_interval <- seq(from = -pi, to = pi, by = 0.1)
fitness_function <- function(f) { rmse(f(function_interval), sin(function_interval)) }

Our end goal is a sine wave that runs from -pi to pi. Our fitness function is the Root Mean Square Error of our generated function versus the ideal of the sine wave, meaning that we want to minimize the result of our fitness function, with a value of 0 matching the sine wave exactly. Now that we have everything in place, let’s kick off the process.

result <- geneticProgramming(
  functionSet = function_set,
  inputVariables = input_variable_set,
  constantSet = constant_factory_set,
  fitnessFunction = fitness_function,
  stopCondition = makeTimeStopCondition(20) #run for 20 seconds
)

This will run for 20 seconds, during which time you will hopefully find functions which converge closer and closer to the ideal. In my case, it looked like this:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 0.706164, time elapsed: 0.84 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 0.445428, time elapsed: 1.4 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 0.445256, time elapsed: 2.42 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 0.445256, time elapsed: 3.53 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 0.445256, time elapsed: 4.32 seconds
evolution step 600, fitness evaluations: 29950, best fitness: 0.445256, time elapsed: 5.26 seconds
evolution step 700, fitness evaluations: 34950, best fitness: 0.445256, time elapsed: 6.15 seconds
evolution step 800, fitness evaluations: 39950, best fitness: 0.445256, time elapsed: 6.75 seconds
evolution step 900, fitness evaluations: 44950, best fitness: 0.445210, time elapsed: 7.59 seconds
evolution step 1000, fitness evaluations: 49950, best fitness: 0.445210, time elapsed: 8.29 seconds
evolution step 1100, fitness evaluations: 54950, best fitness: 0.445210, time elapsed: 8.9 seconds
evolution step 1200, fitness evaluations: 59950, best fitness: 0.445210, time elapsed: 9.81 seconds
evolution step 1300, fitness evaluations: 64950, best fitness: 0.445210, time elapsed: 12.35 seconds
evolution step 1400, fitness evaluations: 69950, best fitness: 0.445210, time elapsed: 13.09 seconds
evolution step 1500, fitness evaluations: 74950, best fitness: 0.445210, time elapsed: 14.81 seconds
evolution step 1600, fitness evaluations: 79950, best fitness: 0.445198, time elapsed: 15.48 seconds
evolution step 1700, fitness evaluations: 84950, best fitness: 0.445198, time elapsed: 16.17 seconds
evolution step 1800, fitness evaluations: 89950, best fitness: 0.445198, time elapsed: 17.29 seconds
evolution step 1900, fitness evaluations: 94950, best fitness: 0.445198, time elapsed: 18.01 seconds
evolution step 2000, fitness evaluations: 99950, best fitness: 0.445198, time elapsed: 18.68 seconds
evolution step 2100, fitness evaluations: 104950, best fitness: 0.445198, time elapsed: 19.31 seconds
Genetic programming evolution run FINISHED after 2138 evolution steps, 106850 fitness evaluations and 20 seconds.

There was some convergence, but it wasn’t the best. Let’s take a look at the best program:

best_solution <- result$population[[which.min(result$fitnessValues)]]
best_solution
min(result$fitnessValues)

That spits back the following:

function (x) 
0.305371214275836 * (x - -0.0445197474230304)
0.445198186160344

The reduced form function is 0.305x + 0.013595. That’s a linear function with respect to x, and the RMSE is 0.445, which is not great.

Here’s a plot of our solution versus the ideal solution:

plot(y = best_solution(function_interval), x = function_interval, type = "l", lty = 1, xlab = "x", ylab = "y")
lines(y = sin(function_interval), x = function_interval, lty = 2)

Genetic program with a one-track mind.

A Craftsman Goes Out And Gets Better Tools

That first result wasn’t very good, and part of the problem is that we did not give the genetic program the right tools to figure out a solution to our problem. So let’s change the function set just a little bit:

function_set <- functionSet("+", "*", "-", "cos")

You and I know that cosine and sine have a direct relationship: shift sine by pi/2 and you get cosine. Let’s see if the genetic program figures out the same. This time around, I ran the function for 7 seconds, and we can see the improvement:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 0.458847, time elapsed: 0.64 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 0.445672, time elapsed: 1.32 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 0.124575, time elapsed: 2 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 0.003608, time elapsed: 2.64 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 0.003608, time elapsed: 4.41 seconds
evolution step 600, fitness evaluations: 29950, best fitness: 0.003608, time elapsed: 5.22 seconds
evolution step 700, fitness evaluations: 34950, best fitness: 0.003608, time elapsed: 5.97 seconds
evolution step 800, fitness evaluations: 39950, best fitness: 0.003608, time elapsed: 6.71 seconds
Genetic programming evolution run FINISHED after 840 evolution steps, 41950 fitness evaluations and 7 seconds.

For the first couple seconds, the algorithm wasn’t doing much better than it had last time around, but then something clicked and the best solution got a lot better. Let’s take a look at it.

best_solution <- result$population[[which.min(result$fitnessValues)]]
best_solution
min(result$fitnessValues)
function (x) 
cos(x - 0.291574502599581 - 1.2741260116152)
0.00360807823117699

In other words, this is cos(x – 1.5657), which is very close to pi/2. If we plot the curve, we’ll see an almost-exact match.

plot(y = best_solution(function_interval), x = function_interval, type = "l", lty = 1, xlab = "x", ylab = "y")
lines(y = sin(function_interval), x = function_interval, lty = 2)

Turns out cosine and sine are actually related.

Finding The Best Tools

We can take this one step further and have the function set include all basic math + trigonometric functions like so:

function_set <- trigonometricFunctionSet

In rgp, this includes the set { sin, cos, tan }. When we run the function with this set, it finds the best solution almost immediately: sin(x).

Tell Me What I Don’t Know

Now we’re going to shake things up a bit. Instead of trying to match a function, we will take a sample of data collected in the wild and attempt to fit a function to it.

if(!require(rgp)) {
    install.packages("rgp", repos = "http://cran.us.r-project.org")
    library(rgp)
}

if(!require(igraph)) {
  install.packages("igraph", repos = "http://cran.us.r-project.org")
  library(igraph)
}

We will still use rgp, but I’m also going to introduce igraph, which allows us to visualize graphs easily.

Unlike before, we don’t have a function in mind; rather, we have some x-y coordinates.

input_data <- data.frame(
    x = c(1, 2, 3, 4, 5, 6, 7),
    y = c(0, 26, 88, 186, 320, 490, 541)
)

If we use plot(x,y) to plot this, we get the following image:

Did you know that disco record sales were up 400% for the year ending 1976?  If these trends continue…AY!

Let’s turn this set of numbers into a plausible example. Let’s say we have a rocket car we’re testing, and the y axis represents speed in miles per hour, whereas the x axis represents seconds. What we see is an immediate burst and exponential growth for the first few seconds, but then we hit an inflection point and start tailing off near the end. If we drew this out for a few more seconds, we’d probably see the y value flatline somewhere under 600.

Now that we have our just-so story in place, let’s build our process. We will use the basic arithmetic operators and perform a symbolic regression of y against x, running for 500 steps.

function_set <- functionSet("+", "-", "*")
result <- symbolicRegression(
    y ~ x,
    data = input_data,
    functionSet = function_set,
    stopCondition = makeStepsStopCondition(500)
)

We could run for a specific period of time as well; rgp has several stop conditions available. Anyhow, here’s how the results look in my particular run:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 33.007936, time elapsed: 3.75 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 33.007936, time elapsed: 7.18 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 33.007936, time elapsed: 10.36 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 33.007936, time elapsed: 13.53 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 33.007936, time elapsed: 16.9 seconds
Genetic programming evolution run FINISHED after 500 evolution steps, 24950 fitness evaluations and 16.9 seconds.

Looks like we hit a pretty decent result early but couldn’t improve on it in the allocated number of operations. Here’s what our result looks like:

plot(input_data$y, col=1, type="l")
points(predict(result, newdata = input_data), col=2, type="l")

No, don’t worry, we can forecast outside of the prediction range no problem!

And here’s the formula that our best result generates:

best_result <- result$population[[which.min(result$fitnessValues)]]
best_result
8.87532839551568 * (-1.28207603423574 * -1.08413319016388) * x * x

Simplifying that, we have approximately 12.336x^2. That’s a pretty good fit for the data set, though it does not capture the falloff near the end and would perform poorly in our out-of-sample data (assuming, again, that subsequent results continue the late trend of trailing off toward a maximum value under 600).

1000 Words Coming Up

I’d like to display the result as a graph, because now that I’m in management, I find myself incapable of understanding anything that does not have a picture associated with it. The rgp package has a nice function called funcToIgraph() built in, which allows you to turn best_result into a graph that igraph can read and display. There’s a bug in rgp, though: when rgp was built, igraph used 0-indexed notation, but they subsequently shifted to 1-based notation. Therefore, rgp has an off-by-one bug. To get around this, I wrote my own function that I can use in this demo:

exprToIgraph <- function(expr) {
  if (!require("igraph")) stop("exprToIgraph: Package 'igraph' not installed.")
  exprGraph <- exprToGraph(expr)
  exprIgraph <- graph(exprGraph$edges, n = length(exprGraph$vertices)) # igraph vertexes used to be counted from zero but no longer are!
  V(exprIgraph)$label <- exprGraph$vertices
  exprIgraph
}

I’d happily submit it as a pull request, but I could only find mirrors and forks of rgp and not any active repository.

Anyhow, let’s satisfy the manager inside and turn this formula into a picture:

final_graph <- exprToIgraph(body(best_result))
plot(final_graph, layout=layout_as_tree, vertex.color="orange", vertex.size=30, edge.arrow.size=0.1)

Somebody out there is mentally picturing this graph in Reverse Polish Notation.

There we go; now I’m rejuvenated and ready to demand some more TPS reports.

Conclusion

Today’s post was a look at a pair of genetic programs which derive mathematical functions. This is one of the main uses of genetic programming, but it’s not the only one. In tomorrow’s post, I will look at another major use case.

Genetic Programming

This is part of a series entitled Genetics In Action.

Up to this point, we have looked at genetic algorithms, one particular evolutionary algorithm.  Now we’re going to take a look at a different evolutionary algorithm called genetic programming.  John Koza popularized genetic programming with his eponymous series of books, starting with Genetic Programming and on through volumes II, III, and IV.

I want to start with points where genetic programming and genetic algorithms overlap, and then we’ll look at the gigantic difference.

More Of The Same

Genetic programs, like genetic algorithms, work from a population of candidate solutions.  Each candidate solution has a chromosome (generally just one), made up of genes.  Genes are the building blocks of the chromosome and the potential values of each gene are its alleles.

In order to determine which candidate solution is the best, we need a fitness function, which we generally implement as a single numeric value.  To get from our starting population to the fittest candidate solution, we take advantage of the crossover and mutation operators and run this process for some number of iterations.

In this regard, genetic programs and genetic algorithms are alike.

But Wait, There’s More!

A big difference between a genetic algorithm and a genetic program  is that genetic programs typically do not have fixed-length chromosomes.  Genetic algorithms do not need to have fixed-length chromosomes either, but this is the norm, at least in Easy Mode.  With genetic programs, even in the easiest setup, we don’t assume fixed numbers of genes.

The other big difference is in the name:  we’re building programs.  In the early genetic programming literature, Koza used Lisp as his language of choice for genetic programs because it has a tree-like syntax that really works well here.  We won’t use Lisp ourselves, but let us take a moment to mourn all of those parentheses who gave their lives in order for us to solve a problem.

Anyhow, the programs can be as simple as mathematical functions or as complex as instruction sets for machinery.  Each program can be displayed as a graph.  I’m going to look at two separate scenarios, focusing mostly on problems in graph format (to make it easy to follow).  First up is mathematical functions, followed by conditional programs.

Mathematical Functions

Let’s say that we want to find the best result to a fitness function.  We have a really, really simple fitness function that always returns (56 – X)^2, and we want to find the global minimum.  We have integer numbers from 0-9 available to us, as well as the mathematical operations +, -, and x.

One potential solution could look like this:

Didn’t even need to count on my fingers for this one.

In this case, we multiply 8 x 7 and get 56.  (56-56)^2 = 0, which is our global minimum.  We’ve solved the problem!

With a genetic program, it’s usually not going to be that parsimonious.  Instead, genetic programs often will have somewhat more noisy answers:

Parsimony? Sounds like a vegetable. Genetic programs *hate* vegetables.

The answer solves our fitness function all the same, so we should be happy with those results.  But if you want to understand the solution, you’ll often need to reduce the outcome to its simplest form.

Something to notice is that in this mathematical function, the mathematical operators are non-leaf nodes, whereas numeric values are leaf nodes.  If we introduced variables like m and n, those variables could be on leaf nodes as well, but they would not show up on non-leaf nodes.

Conditional Programs

The other major type of genetic program is a conditional program.  The end result here is not a mathematical formula, but rather a decision.

Here’s an old, old sample from my master’s thesis:

It made sense then, honest.

In this sample, we have non-leaf conditionals which lead to leaf node decisions.  This is a simple problem based on a variant of the Prisoner’s Dilemma.  Let’s describe this agent’s behavior.  If the agent defected two turns ago, then check what the opponent agent did two turns ago.  If the opponent defected two turns ago, then exit the game; otherwise, if the opponent cooperated two turns ago, defect.  Finally, if the agent did not defect two turns ago, defect this turn.  This agent’s kind of a jerk.

Again, this is an example of a simplified tree.  A more realistic scenario looks a bit like another example from my thesis:

(own_prev x (opp_ever_def (opp_prev x (own_prev2
x d)) (own_prev (own_prev2 (opp_ever_def
(opp_prev x (own_prev2 (own_prev2 x c) d))
(opp_ever_def d (opp_prev2 (good_res x x)
c))) d) (opp_ever_def (opp_ever_def (own_prev
x x) (good_res (opp_prev d c) (own_prev (opp_ever_def
(opp_prev x (own_prev2 (opp_ever_def (good_res
(opp_prev2 (good_res x x) c) x) (opp_prev2
(own_prev2 x c) c)) d)) (opp_ever_def (own_prev
x x) (own_prev2 (good_res (own_prev x x)
(own_prev (opp_ever_def (own_prev2 x c) (good_res
(opp_ever_def (good_res d d) (own_prev2 x
c)) x)) (good_res x x))) c))) (good_res x
(own_prev (own_prev x x) (opp_ever_def (opp_prev
x (own_prev2 (own_prev (good_res (good_res
d d) (opp_ever_def (good_res d d) (own_prev2
x c))) (opp_prev2 (good_res x x) c)) d))
(own_prev2 x c))))))) (own_prev2 x c)))))

Nope, not going to draw a graph for that one…  It simplifies down to the following:

(own_prev  x  (opp_ever_def  (opp_prev  x  (own_prev2  x  d))) (own_prev2  x  c))

If you go through the program, you’ll see areas that we can simplify:  contradictory code branches (e.g., if you defected last turn and if you did not defect last turn), redundant results (if you defected last turn, then defect this turn; otherwise, defect this turn), and the like.

Detailing The Operation

So we’ve looked at mathematical functions and conditional programs, but we haven’t quite described the mechanics behind how genetic programs form new candidate solutions over the generations.  It’s easiest to think about this in graph mode, so let’s start with a pair of candidate solutions.

How I met your generational antecedent.

Let’s suppose we want to combine these two programs.  What we would do is find a subtree and perform a lop-and-splice technique, which is totally different from the chop-and-swap of genetic algorithms.  Our job is to cut off a subtree from each of the two parents and splice the new subtrees in.

Don’t worry, I’m a trained professional. I hardly ever lop of the wrong branch.

In this case, we’re swapping the right subtree under x on the left-hand side with the left subtree under + on the right-hand side.  The subtrees do not need to be the same size in order to swap.  The subtrees do not need to be at the same level in order to swap.  And the subtrees do not even need to contain non-leaf nodes.

Once we’re done, it’s time to splice the subtrees, creating two brand new programs.

Just a little bit of duct tape and you’re good as new.

We now have two new programs, which means crossover is complete.  Mutation is similar to the genetic algorithms example.  We can mutate non-leaf nodes as well as leaf nodes, but we need to follow the rules of what’s allowed to be where.

Something in the water keeps causing all of these mutations.

The left-hand program experienced two mutations, whereas the right-hand tree experienced none.  Now we have two new programs and can continue the process.  Similar to genetic algorithms, we can run genetic programs for a certain amount of time, a certain number of generations, a certain number of operations, or until we get the result we’re expecting.

Conclusion

This was a cursory introduction to genetic programming.  Next, I’m going to showcase a couple examples of genetic programming in R.

Plot SQL Server CPU Usage With R

EDIT 2017-06-16: I revised the script a little bit to add in total CPU as well, making it easier to see when you’re at 100% utilization.

This is a quick way of grabbing the last 256 minutes of CPU usage data from SQL Server and plotting it using R.

The SQL Server script comes from Glenn Berry’s outstanding set of DMV queries.  Here’s the R code:

if(!require(RODBC)) {
    install.packages("RODBC")
    library(RODBC)
}

if(!require(ggplot2)) {
    install.packages("ggplot2")
    library(ggplot2)
}

if(!require(tidyverse)) {
    install.packages("tidyverse")
    library(tidyverse)
}

# Fill in your own connection string details here
conn <- odbcDriverConnect("Driver=SQL Server;Server=RDU-SQL-R-01.auctionrover.com;Initial Catalog=DemandForecaster;Provider=SQLNCLI11.1;Integrated Security=SSPI")

# This is Glenn Berry's query to get CPU percentage data for the last 256 minutes.
# https://www.sqlskills.com/blogs/glenn/category/dmv-queries/
cpu_query <- paste("DECLARE @ts_now BIGINT = (SELECT cpu_ticks / (cpu_ticks / ms_ticks) FROM sys.dm_os_sys_info WITH (NOLOCK) );

SELECT TOP (256)
	SQLProcessUtilization AS SQLCPU,
	SystemIdle AS Idle,
	100 - SystemIdle - SQLProcessUtilization AS OtherCPU,
	DATEADD(ms, -1 * (@ts_now - [TIMESTAMP]), GETDATE()) AS EventTime
FROM (
		SELECT
			record.value('(./Record/@id)[1]', 'int') AS record_id,
			record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int')
			AS [SystemIdle],
			record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int')
			AS [SQLProcessUtilization],
			[TIMESTAMP]
		FROM (
			SELECT
				[TIMESTAMP],
				CONVERT(XML, record) AS [record]
			FROM sys.dm_os_ring_buffers WITH (NOLOCK)
			WHERE
				ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
				AND record LIKE N'%<SystemHealth>%'
		) AS X
	) AS Y
ORDER BY
	record_id DESC OPTION (RECOMPILE);")

cpu_usage_by_minute <- sqlQuery(conn, cpu_query)

cpu_usage_by_minute$TotalCPU <- cpu_usage_by_minute$SQLCPU + cpu_usage_by_minute$OtherCPU
cpu_usage_by_minute$EventTime <- lubridate::ymd_hms(cpu_usage_by_minute$EventTime)
cpu_usage_by_minute_plot_df <- cpu_usage_by_minute %>%
    gather(key = CPUType, value = PercentUse, SQLCPU, OtherCPU, TotalCPU) %>%
    select(EventTime, CPUType, PercentUse)

ggplot(cpu_usage_by_minute_plot_df, aes(x = EventTime, y = PercentUse, color = CPUType)) +
    geom_line(stat='identity')

close(conn)

At the top of the script, we have a construct to load a package if it already exists; if it doesn’t already exist, install and load it.  The packages we will use today are RODBC (to connect to SQL Server), ggplot2 (to visualize the data), and tidyverse (to structure the data in a nicer format).  Here is an example of the script in action:

On my instance, I have SQL Server R Services installed and am hammering away at the box with it, training a bunch of neural nets concurrently.  We can see that I got started with this sometime around 10:30 AM, stopped it for a few minutes, and then went nuts around 10:45 AM.

Genetic Algorithms In R: The Holyfield Problem

This is part of a series entitled Genetics In Action.

In the last post, we looked at solving a basic genetic algorithms problem:  finding a global maximum for a function consisting of a single real-valued variable.  Today, we are going to look at a more complex interaction and solve one of life’s more interesting challenges:  the knapsack Holyfield problem.

The Holyfield Problem

The Holyfield Problem stems from the classic Sega Genesis video game, Evander Holyfield’s Real Deal Boxing.

Evander Holyfield’s Real Deal Boxing spurred a number of PhD theses back in its day.

The purpose of this game was as complicated as it was challenging:  to make the best boxer there ever was and defeat your father figure in the ring.  There’s a Freudian element in there I’m not particularly comfortable with, but I don’t get to choose the plots of video games.

Anyhow, as mentioned, you want to be the best there is, and the only way to be the best is to train the best.  That’s why the game offers you a number of training opportunities, each of which builds up at least one of the four core skills necessary to punch large men repeatedly in a ring.

An artist’s rendering of the author, shortly after the author slipped the artist a $50.

Above, we see eight of the options available, and of the three I’ve selected (punching bag, jump rope, and loose weights), what they would do for my namesake.  All in all there are 19 training options available, and figuring out the best combination sounds like a job for a genetic algorithm!

Defining The Problem

Let’s take a moment and build out a proper definition of the problem.  As mentioned, we have 19 potential training options available to us.  Each training option has some positive effect on at least one of four stats:  power, stamina, speed, and defense.  Each training option also comes at a cost, and we have a fixed budget.  Our goal is to maximize a fighter fitness function (which I’ll describe below) contingent upon remaining at or below the total allowable cost.

Note that the search space here is the product set of alleles.  We have 19 genes, each of which has 2 alleles, so it’s 2^19.  That means that there are 524,288 potential training regimens.  Many of these will be over cost, but that still leaves a large number to sift through.

Building The Chromosome

Each organism in this example will have a single chromosome with 19 genes, representing each of the 19 training options.  Because we select training options without replacement, the best representation for each gene is a binary set of alleles:  either we have selected the training option or we have not selected it; we cannot select it twice.

Let’s start writing some code:

if(!require(GA)) {
    install.packages("GA", repos = "http://cran.us.r-project.org")
    library(GA)
}

if(!require(ggplot2)) {
    install.packages("ggplot2", repos = "http://cran.us.r-project.org")
    library(ggplot2)
}

training_options <- data.frame(
  item = c("Exercycle", "Head Guard", "Health Club", "Iron Gum Shield", "Jog Machine", "Jump Rope", "Karate",
           "Loose Weights", "Multi Gym", "Power Gloves", "Protein Diet", "Punch Bag", "Running Shoes", "Sparring",
           "Speed Bag", "Speed Boots", "Step-O-Matic", "Track Work", "Vitamins"), 
  stamina = c(2, 0, 1, 0, 2, 1, 0, 0, 1, 0, 1, 2, 2, 0, 1, 0, 2, 1, 1),
  speed =   c(1, 0, 1, 0, 1, 1, 2, 0, 1, 0, 1, 0, 1, 0, 1, 4, 0, 2, 1),
  power =   c(0, 0, 1, 0, 0, 0, 0, 2, 1, 3, 1, 2, 0, 0, 0, 0, 2, 0, 1),
  defense = c(0, 2, 1, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0),
  cost =    c(2, 2, 3, 1, 3, 1, 2, 1, 3, 3, 2, 2, 1, 3, 2, 3, 3, 2, 2))
training_balance <- 8

We have a total training budget of 8 and can see the training options available.  Let’s view this in a table format, as that’s easier to read:

item stamina speed power defense cost
Exercycle 2 1 0 0 2
Head Guard 0 0 0 2 2
Health Club 1 1 1 1 3
Iron Gum Shield 0 0 0 3 1
Jog Machine 2 1 0 0 3
Jump Rope 1 1 0 0 1
Karate 0 2 0 3 2
Loose Weights 0 0 2 0 1
Multi Gym 1 1 1 0 3
Power Gloves 0 0 3 0 3
Protein Diet 1 1 1 0 2
Punch Bag 2 0 2 0 2
Running Shoes 2 1 0 0 1
Sparring 0 0 0 4 3
Speed Bag 1 1 0 0 2
Speed Boots 0 4 0 0 3
Step-O-Matic 2 0 2 0 3
Track Work 1 2 0 0 2
Vitamins 1 1 1 0 2

Each option has a cost between 1 and 3 points and generates at least 1 point of skill.  Here’s a sample chromosome, where I selected a few options:

sample_chromosome = c(1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
training_options[sample_chromosome == 1, ]

The sample_chromosome is a vector of options, and we feed those sample chromosome values into the training_options data frame to extract only the options I selected:

item stamina speed power defense cost
1 Exercycle 2 1 0 0 2
4 Iron Gum Shield 0 0 0 3 1
5 Jog Machine 2 1 0 0 3
11 Protein Diet 1 1 1 0 2

This is a valid set of options, in that its cost is within our training budget.  How good is it, though?  Well, in order to figure that out, we need to define a fitness function.

We Pump And We Pump

Our fitness function will explain to us just how good each candidate solution is.

EvaluateBoxerTraining <- function(chromosome) {
  # Each of the training stats follows a log form, so a balanced boxer is better.
  # Then, we'll emphasize speed over power over defense over stamina because reasons.
  eval_speed <- 2.2 * log2(chromosome %*% training_options$speed)
  eval_power <- 1.9 * log2(chromosome %*% training_options$power)
  eval_defense <- 1.6 * log2(chromosome %*% training_options$defense)
  eval_stamina <- 1.3 * log2(chromosome %*% training_options$stamina)
  
  eval_score <- eval_speed + eval_power + eval_defense + eval_stamina
  eval_cost <- chromosome %*% training_options$cost if (eval_cost > training_balance)
    return(0)
  else
    return(eval_score)
}

Let’s walk through this for a moment. Starting with speed, we can see that there’s a function to determine the speed score. That function takes the natural log of chromosome %*% training_options$speed. This is matrix multiplication of the chromosome vector times the speed vector in the training_options data frame. Going back to linear algebra, multiplying a 1xN matrix by a 1xN matrix gives us a 1×1 result. Because our chromosome is made up of 0s and 1s, what we end up doing is adding up all of the speed points that we’ve selected, so that’d be 3 in our sample chromosome above (1×1 + 0x1 + 1×1 + 1×1). As another example, for defense, the calculation is the same: (1×0 + 1×3 + 1×0 + 1×0 = 3). This is a nice way of eliminating an unnecessary for loop in R, and when you’re doing these calculations thousands of times, it can shave off enough time to make this worth knowing.

So we perform matrix multiplication to get the sum of our speed bonuses, which is 3. Then we take the natural log of 3 (which is 1.0986) and multiply it by 2.2, giving us a speed score of 2.4169.

Our fitness function is the summation of these four scores: the scores for speed, power, defense, and stamina. If we go over the training balance, we return 0 for the fitness score, as it’s an invalid combination; otherwise, we return the fitness score we calculated.

As a quick aside, the reason I take the natural log of each of these scores is that it gives us a nice property in R: if the speed value is 0, then the result is -Inf, which we can guarantee will not be the best answer. This requires our boxers to have a minimum value in all four attributes.

Looking at this fitness function, it seems that we want ceteris paribus to emphasize speed and power over defense and stamina. Speed and power give us better bonuses, so picking options like speed boots and power gloves would seem to be winners.

Going back to our sample, if I evaluate the sample’s fitness using EvaluateBoxerTraining(sample_chromosome), I get back a score of 9.041364. Is that a good score or a bad score? Who knows? We’ll have to test it against other scores to figure out if we’ve accidentally-not-accidentally picked the best set of training options.

Starting A Boxing School

Now we want to build up our genetic algorithm and put tens of thousands of boxers to the test.  Specifically, I am going to look at a population size of 160 candidate solutions over 100 generations, for a total of up to 16,000 searches—so if we looked at a random search mechanism with the same number of steps, we’d hit just over 3% of the total search space.

Our model is simple, though there are a couple of differences from the first genetic algorithm we saw.

model <- ga(type = "binary", fitness = EvaluateBoxerTraining, popSize= 160,
            run = 50, nBits = 19, pcrossover = 0.8, pmutation = 0.2)
summary(model)

We are using binary genes, so the type is set to binary rather than real. We have defined our fitness function, and then have set a few parameters. Our population size is 160 boxers, and each boxer has a chromosome with 19 bits. Our crossover probability is 80% and our mutation rate is 20%. Finally, we will stop the simulation if the most-fit boxer has not improved in at least 50 generations. This gives us a short-circuit opportunity in case we find the best early on.

Viewing the summary, we end up with a result something like this:

+-----------------------------------+
|         Genetic Algorithm         |
+-----------------------------------+

GA settings: 
Type                  =  binary 
Population size       =  160 
Number of generations =  100 
Elitism               =  8 
Crossover probability =  0.8 
Mutation probability  =  0.2 

GA results: 
Iterations             = 69 
Fitness function value = 15.35445 
Solution = 
     x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19
[1,]  0  0  0  1  0  1  1  1  0   0   0   1   1   0   0   0   0   0   0

We went a total of 69 generations and picked a training set which gives us a fitness score of 15.35445, way better than the 9.041364 we had above. We can look at the solution to see exactly which options the algorithm chose:

solution <- model@solution
solution_options <- training_options[solution == 1, ]
solution_options
item stamina speed power defense cost
4 Iron Gum Shield 0 0 0 3 1
6 Jump Rope 1 1 0 0 1
7 Karate 0 2 0 3 2
8 Loose Weights 0 0 2 0 1
12 Punch Bag 2 0 2 0 2
13 Running Shoes 2 1 0 0 1

Neither of our big speed/power boosts show up.  Furthermore, using colSums(solution_options[2:6]) to sum up each column, the results are surprising:

stamina 5
speed 4
power 4
defense 6

Instead of skewing toward speed and power, the best solution actually skews toward defense and stamina!

Now that we’ve found the best solution, let’s look at the evolution of our solution space over time:

“You’re a bum, you’re always been a bum, and you’ll never be anything but a bum!” The GA coach is a mean coach.

Notice that there are several stair-step increases in the genetic algorithm results.  In the first few generations, the best score was a 0.  Eventually, we jump up to 10, and then a few jaunts up to the 13-14 range before settling on our final score of 15.35445.  It took more than 35 generations for the median to reach our peak fitness, and notice the erratic behavior of the mean:  that’s because we have cases with -Inf results dragging down the mean.  But just like before, we’re not concerned with the specific values of anything except the best-fit result, so that’s fine.

Conclusion

This was a more complicated genetic algorithm, but it shows a classic example of how to solve a problem.  We need to be able to define the problem in terms of genes, alleles, and a fitness function.  If we can do that, then we will be able to feed those into the genetic algorithm.  Many problems are not as easy to fit into a genetic algorithm as our problem was, but if you’re able to define the problem in these terms, you can solve a complicated optimization problem while looking at a tiny fraction of the entire search space.

One nice property that I did not emphasize in the above example is that genetic algorithms don’t work on heuristics like we do.  In my explanation, I gave what I thought would be the best solution heuristic:  look for high-speed and high-power training options and focus on them.  It turns out that this was not in fact a winning strategy.  We don’t really have a way of telling the genetic algorithm to look in a certain part of the search space (not directly, at least), and that’s a good thing.  Although you might think that narrowing the search space would help the algorithm, what it really does is make it more likely that we find a local maximum instead of the global maximum.

Next up on our agenda is genetic programming.

Using Genetic Algorithms In R

This is part of a series entitled Genetics In Action.

In today’s post, we are going to look at the GA package in R, which implements genetic algorithm.  This package allows us to build different types of genetic algorithms in just a few lines of code.  Let’s start with an easy example.

Maximizing A Mathematical Function

Our first example involves finding the global maximum for a mathematical function.  Here’s the setup code:

if(!require(GA)) {
    install.packages("GA", repos = "http://cran.us.r-project.org")
    library(GA)
}

f <- function(x) { (x^2+x)*cos(x) }
min <- -10; max <- 10

The first bit loads the GA library if it does not exist. Then, we generate a fitness function called f. This fitness function tells us how well we’re doing for any value of x. We also have a couple of constraints, specifically that x must be in the range [ -10, 10 ].

Let’s view the curve that this fitness function generates by using the curve function like so : curve(f, min, max, n = 1000).

Child, this is the shape of our domain.  We own everything the eye can see, at least until you get to double digits.

Our goal is to find the value x which maximizes f(x) over the relevant range.  Eyeballing the problem, we can see that the maximum value is somewhere around 6.5, but note that there’s a local maximum somewhere around -6.5.  Many hill-climbing algorithms will get stuck around -6.5 if you start out near that point, so we want to make sure that our genetic algorithm doesn’t get stuck there.

To build the genetic algorithm and run its test, call the ga function:

GA <- ga(type = "real-valued", fitness = f, min = min, max = max, monitor = FALSE)
summary(GA)

We selected a type of real-valued, which means that our genes are numeric values with decimals. Our fitness function is the function f, which we defined above, and the constraint is that we must keep our genes between min and max. After the algorithm runs, we get back a summary:

+-----------------------------------+
|         Genetic Algorithm         |
+-----------------------------------+

GA settings: 
Type                  =  real-valued 
Population size       =  50 
Number of generations =  100 
Elitism               =  2 
Crossover probability =  0.8 
Mutation probability  =  0.1 
Search domain = 
     x1
Min -10
Max  10

GA results: 
Iterations             = 100 
Fitness function value = 47.70562 
Solution = 
           x1
[1,] 6.560509

This tells us all about the genetic algorithm. We can see that the defaults are 50 population members, 100 generations, a crossover probability of 80%, and a mutation probability of 10%. We can see that we went through 100 iterations and found that the point 6.56 has a fitness of 47.70562, which is the highest possible score.

If you want to visualize what the algorithm did over time, calling plot(GA) will generate a plot for you:

Plotting the genetic algorithm’s evolution.

There are three things measured here:  the best result, the mean result, and the median result.  In this case, we ended up hitting the best result in generation 1, and that best result stuck around until the end of the simulation.  Normally, we don’t hit the global maximum right from the beginning, but we got lucky this time.  Notice that the median population fitness score eventually converges on the global maximum, but the mean fitness score never really makes it that high.  This is actually a good thing for us:  it shows that at 100 generations, there is still enough genetic diversity that we are building population members which search across the valid space just in case our current best isn’t really the best.  In a non-trivial problem, it could take dozens or hundreds of generations to converge, and if you run out of viable alternatives too early, you’ll never get past the local minimum.  But once the simulation is over, we really only care about the single most fit solution.

Let’s plot our best solution now:

curve(f, min, max, n = 1000)
points(GA@solution, GA@fitnessValue, col = 2, pch = 19)

Our best solution is the global maximum.

We were able to find the global maximum, just as we wanted.

Conclusion

In this post, we were able to use a genetic algorithm to solve a single-variable maximization problem.  It’s not the most exciting problem to solve, but it does give us an introduction into the mechanics of the GA package.  In the next post, we will look at another classic optimization problem.

Building A Genetic Algorithm

This is part of a series entitled Genetics In Action.

So far, we have learned a little bit about evolutionary algorithms and taken a look at just enough high school biology to review the basics of genetics.  Today, I want to look at one particular type of evolutionary algorithm called a genetic algorithm.

Although John Holland did not invent the concept of genetic algorithms, he is the man most responsible for popularizing and developing the concept.  Holland’s Adaptation in Natural and Artificial Systems is a classic of the field and ties together the genetic metaphor.  I highly recommend this book if you’re interested in the topic.

Digging Into The Algorithm

In the last post, we looked at how the genetic metaphor ties development concepts to biological concepts, and today we will move beyond that high-level description and cover the specifics of how genetic algorithms work.  We will start by looking at the simplest type of genetic algorithm:  a single chromosome with a fixed number of genes, each of which has two alleles.  In computer science terms, think about a fixed-length array of boolean values.

GA

Two sample chromosomes

In the image above, we can see two sample chromosomes, each of which has eight genes.  We want to build up a population of these chromosomes at random—one of the interesting parts of the genetic algorithm is that it (usually) doesn’t matter where in the total search space we begin, so starting at a bunch of random points is just as good as anything else.

When it comes to choosing the size of the population, there aren’t too many hard-and-fast rules.  I have read recommendations that you should have at least 2 * N, where N is the number of genes that each chromosome has.  If you’re looking at 10-30 genes, I’ve typically had good luck with a population size of somewhere between 100 and 500.  You’ll find out that there is a maximum interesting population size, after which you don’t really get any benefit:  you won’t converge on a solution any faster, and it will take longer to do all of the processing.

Once we have our population, the next step is to score each organism in the population.  To score an organism, we apply a fitness function.  In computer science terms, this is usually an actual function, where we use the organism’s chromosomal representation as the inputs and generate and return a score for each chromosome.

Fitness

Scoring each organism

In the image above, we have defined a score for each organism.  This score is typically one of two things:  either it is the distance from an ideal point, or it is a valuation.  In the first case, think of a set of (x, y) coordinates.  We want to define a chromosome that, given an x coordinate, will generate its appropriate y coordinate.  We will calculate some distance between the predicted y and the actual y (for example, we could calculate the Root Mean Square Deviation), where a perfect match has a deviation score of 0.  On the other side, suppose that we can produce N separate products.  Each product has a cost and a price for sale.  Our genetic algorithm might describe the combination of goods we create, and the score would be the net margin (revenue minus cost) of those products.  In that case, a higher number is better for us.

It’s A Generational Thing

Now that we have the basic idea of a fitness score behind us, let’s go to the next step:  making babies.

Parents

We use a complicated matrix with dozens of measures to find the best fit

Now I am showing the entire population, which has four members.  Each member of the population has its own score, and we will use those scores to help us figure out the next generation of organisms.  The mechanism I am showing in the picture above is the simplest mechanism for genetic algorithms, which is the roulette wheel selection.  Basically, take the fitness values for each member of the population and you get a total score—that’s the 508 above.  Figure out each member’s percentage of the score, and you have a set of values which sums up to 1.  Pick a random number between 0 and 1, and wherever you land on the cumulative distribution function, you have your choice of parent.  Note that you draw with replacement, meaning that you can pull the same organism more than once.

To motivate this example, let’s suppose that red owns numbers from 0 up to .335, blue owns numbers from .335 up to .587, green owns .587 until .815, and yellow owns .815 through 1.0.  Our first random number drawn is .008, so the lucky winner is red.  Then, we draw a second parent and pulled .661, which happens to be squarely in green territory.  We now have our two parents.

Crossover

Now that we have our two parents, we are going to generate two children.  I need to introduce a new concept:  crossover.

Crossover

When a mommy chromosome loves a daddy chromosome very much…

Crossover is the recombination of a segment of a chromosome.  In the example above, we are switching genes 3-5 from each of the parents for each of the children (though child #2 is shy and is hiding off-camera).

This action is part of the genius behind genetic algorithms.  We’ve already taken some of the fittest organisms (in a large population, we’re going to pull successful organisms much more frequently than unfit organisms, and there are other pulling techniques which bias toward successful chromosomes even more than roulette wheel), and by recombining slices of their genes, we are able to test the waters with new combinations to see if we can find something even more successful.

Of course, there’s no guarantee that the new combination will be more successful than its parents were, so we have a concept known as the crossover percentage.  That is, we only perform crossover a certain percentage of the time.  In practice, this is often anywhere from 60% to 90%.  If we don’t perform crossover, then the two chromosomes just mosey on into the next step of the process.  But if we do roll the dice and land on the ol’ chop-and-swap, then we have two more RNG rounds to play.

The first of these bonus random number pulls determines where we start the chop, and the second pull determines where we stop.  In the picture above, we start at gene 3 and end at gene 5, inclusive.  In genetic algorithms, we typically have fixed-size chromosomes (though that’s not always true!) and therefore symmetrical swaps.

Turtle Power

The last step in our genetic factory floor is mutation.  One problem with survival-of-the-fittest is that, especially in later generations, we might run low on genetic diversity.  At an extreme, we end up with a population full of exactly the same chromosome, so no matter how you slice them, you get no novel patterns.  If we’ve reached the global maximum, that’s acceptable, but what if we ended up at only a local maximum and can’t jump over to the global max?  That’s where mutation comes into play.

Mutation

Not pictured:  martial artist rat or antagonistic, anthropomorphic pig & rhino combo

Mutation works by modifying a particular gene’s value.  For each gene in each new chromosome, mutate with a probability p.  Usually p is somewhere between 0.001% and 1%, though I’ve read papers that argue in certain circumstances, you might need a mutation rate of 20% to 30%.  Those would be cases with very flat local minima/maxima where you can get trapped in that flat spot and need a kick out.

If you want a fitting metaphor for flat spots, I had an old Toyota Corolla with an old starter.  I’d be able to start the car up successfully several times in a row, but then I’d hit a dead spot in the flywheel and it just wouldn’t start.  Eventually, my dad bought me the Official Starter Repair Kit:  a crowbar.  His advice was to apply sufficient percussive force until I nudged the starter out of its dead spot, and then the car could start successfully.  Mutation provides benefits in much the same way.  And just like my beater bar, mutation is not a technique you want to rely upon constantly.  At the extreme, mutation is just random search, losing all of the important information that a genetic algorithm learns during the process.

Finishing The Process

At this point, we have a finished organism.

NewOrganism

All grown up and ready to take on the world

We do this for each slot in the population and then repeat the process:  score, choose, cross over (or not), mutate.  We have a few potential end conditions.  Some of them are:

  1. Stop after a fixed number of generations
  2. Stop after we reach a known ideal solution (e.g., 0 distance from the actual values)
  3. Stop after we get close enough to a known ideal solution
  4. Stop after we have stasis for a certain number of generations, where we have not improved the fitness score for the best agent in a while
  5. Stop after a certain amount of time

There are other stop conditions as well, but these are the most common.

Conclusion

Today we covered the basics of genetic algorithms and how the process works.  Tomorrow, we’ll look at using a genetic algorithm library to solve different types of problems.

The Basics Of Genetics

This is part of a series entitled Genetics In Action.

In today’s post, I want to cover some extremely basic genetic concepts.  We use these concepts in genetic algorithms and genetic programming, and understanding the biological explanations will help us apply the metaphor to software development.

Evolutionary Biology In 400 Words

We have a population of self-contained entities called organisms.  Each organism is entirely independent from other organisms, in the sense that one organism in our population does not depend upon another organism for survival.  In evolutionary algorithms, organisms are our candidate solutions to problems:  we have a number of solutions in our population, and each solution is independent of the other solutions (although they will interact in complex ways, as we’ll talk about later).

Each organism has at least one chromosome.  The purpose of a chromosome is to carry genes.  In evolutionary algorithms, we often simplify the problem by saying that an organism has one chromosome, and it can become easy to conflate organisms and chromosomes for that reason.

Digging deeper, each gene has a number of alternate forms, called alleles.  Combining genes and alleles gives us DNA sequences called genotypes.  A genotype is a possible genetic structure.  Suppose that we have 32 genes, each of which can have 2 alleles.  This would give us 2^32 possible combinations, or a total of 4,294,967,296 possible genotypes.

Genotypes determine phenotypes.  Phenotypes are observable physical characteristics.  Classic examples of phenotypes include eye color, hair color, height, size, and beak structure.  But it’s important to understand that there is no 1:1 correspondence between a genotype and a phenotype; some number of genotypes can end up generating the same phenotype.  Phenotypes depend upon a certain combination of alleles but not the entire genotype.

Also, this goes the other way as well:  a phenotype may only appear when a particular combination of alleles hold a particular value.  Close may count in horseshoes and hand grenades, but it doesn’t count with genetics:  if there are seven alleles which combine to cause a phenotype and only six are set, you will not see the physical characteristics.  Think of this like a password:  unlike in TV and movies, if you have 18 of the 19 characters in a password correct, you shouldn’t get any indicator that you’re ever-so-close.  If we’re using an evolutionary process to find that password (which probably isn’t a good use for an evolutionary algorithm), we won’t get positive feedback until we get it absolutely correct.

The last aspect of genetics that I want to cover today is an environmental niche.  A niche is a set of features which certain phenotypes can exploit.  Think about life on the Arctic:  features like thick fur and layers of fat can help animals survive in these frigid climes.

Applying These To Evolutionary Algorithms

To apply biological concepts to evolutionary algorithms, I’m going to flop around a little bit and start at the end.

First, think about an environmental niche.  The programming concept we’re describing here is a fitness function.  An example of a fitness function may be a set of (x, y) coordinates, and our goal is to find something which maps as closely as possible to those coordinates.

HillClimbingProblem

An example of a fitness function

To solve the fitness function, we want to build up a population of organisms.  Our organisms are actually candidate solutions.  In the example above, these would be mathematical functions which fit the fitness function to a greater or lesser extent.

For the sake of simplicity, each organism will have one chromosome.  In its simplest form, a chromosome is an array which contains a set of elements.  Those elements are akin to genes in our biological nomenclature.  Each element has a set of alleles, that is, possible values.  Therefore, the genotype is the particular combination of elements in the chromosomal array.

A trivial example of a chromosome is a binary array:  [ 0, 0, 1, 0, 1, 1, 0, 1 ] would be a sample chromosome.  Here we have 8 genes, each with 2 alleles, for a total of 256 genotypes.  We apply this to the fitness function and get back a score, and that score tells us how fit the organism is.

In the next post, we’ll go deeper into this metaphor, diving into genetic algorithms.  I’ll explain more about fitness functions and explain what makes these evolutionary algorithms rather than biological algorithms.