Bonus Evolutionary Algorithm

This is part of a series entitled Genetics In Action.

To wrap up this series on evolutionary algorithms, I want to provide you with one final algorithm, the Flopping Fish Algorithm. This is not a particularly good algorithm at all, but it is one that I’ve seen in practice.

Let’s start by taking a genetic algorithm. As you’ll recall from prior posts in the series, genetic algorithms have a pair of important modification rates: the crossover rate and the mutation rate. The crossover rate gives a genetic algorithm its power, whereas the mutation rate helps kick us out of local minima/maxima. The crossover rate should be fairly high (60-90% is a reasonable range) whereas mutation should be low (1% or so is a nice start, though in certain extreme cases, it might go as high as 20-30%).

In contrast to the genetic algorithm rates, let’s go the opposite way: have a mutation rate above 50% and a crossover rate of 10% or less. As an aside, I’m picking these numbers because those were actually numbers I once found in a production system.

So what does this high-mutation, low-crossover search pattern look like? Flopping fish. Let’s motivate this with a story.

Imagine that you have a two-dimensional board of some arbitrarily large size. Each spot on the board is shaded somewhere between entirely white and entirely red. Your goal is to land on the reddest part of the board.

With a genetic algorithm, you build a fitness function where the output would be the amount of red on the board. For every (x, y) combo, you get an R value somewhere in the range of [ 0, 255 ]. You start out with a random sample of squares on the board and splice x and y values, and your strongest organisms tend to move in a semi-orderly fashion toward those darker-shaded spots. By the end of the run, you hope to have landed on the darkest spot.

By contrast, the flopping fish algorithm has you start off with a bucket of fish. You pour them onto the board and watch them as they flop around. Their flopping is not entirely aimless, but there is little rhyme or reason to the motions. Eventually, the fish stop flopping and you collect the points at which they met their doom, choosing the darkest-hued spot as your solution.

The flopping fish algorithm is a waystation between a genetic algorithm (which you might consider controlled “randomness”) and random selection of points. It is a little different from pulling a bunch of random numbers from thin air—with the flopping fish, each fish can move only so far in its gasps—but the effect is about the same. A genetic algorithm with too large a mutation rate effectively ignores the signals that the fitness function provides—and remember, those signals are the *only* signals the algorithm gets in order to tell if it’s on the right track.

If you’re going to run a genetic algorithm with an extremely high mutation rate and extremely low crossover rate, save yourself the trouble and just pluck random numbers from thin air; that’ll save you time and effort.

Genetic Programming: DBA Salaries

This is part of a series entitled Genetics In Action.

Today’s post shows off another variant of genetic programming.  Last time around, we looked at mathematical functions using rgp.  Today, we are going to build conditional programs using evtree, another evolutionary library in R.

Time To Buy Me A DBA

The data set we will use is the survey of data professional salaries that Brent Ozar put together in 2016.  I’ve already looked at the data once, but now I want to apply a different approach:  I want to see how well I can explain and predict variance in DBA salaries.

Starting out, I’m going to load three packages:  evtree (which lets me build evolutionary trees), XLConnect (which lets me read Excel files), and tidyverse (which is the Swiss army knife of R packages):

if(!require(evtree)) {
    install.packages("evtree", repos = "http://cran.us.r-project.org")
    library(evtree)
}

if(!require(XLConnect)) {
  install.packages("XLConnect", repos = "http://cran.us.r-project.org")
  library(XLConnect)
}

if(!require(tidyverse)) {
  install.packages("tidyverse", repos = "http://cran.us.r-project.org")
  library(tidyverse)
}

Let’s pull in the salary data using XLConnect.

wb <- loadWorkbook("Data/2017_Data_Professional_Salary_Survey_Responses.xlsx")
salary_data <- readWorksheet(wb, sheet = "Salary Survey", region = "A4:T2902")

Next up, I would like to look at the structure of the salary data, which I can do easily by running str(salary_data).

We have a sample of approximately 3000 data professionals, as well as 18 variables per observation—I’m ignoring Timestamp and Counter, as they are not relevant to the discussion.

What To Do, What To Include

Last time, I looked at purchasing power parity to try to find countries where data professionals are relatively better off. This time around, I’m going to limit my search to data professionals in the United States. Even with the United States, there are differences in PPP—after all, a DBA living in San Francisco or New York City making $100K a year is quite a different story than a DBA in Peoria or Richmond making $100K a year—but there’s a smaller spread. It’s a spread that I cannot control, so I’m expecting that I won’t be able to explain a healthy chunk of the variance, but we proceed apace.

Next, I will focus only on those who selected Microsoft SQL Server as their primary database. I don’t want to deal with level differences between Oracle, DB2, MySQL, and SQL Server. Also, a big majority of respondents selected SQL Server, so there’s probably some bias in the non-SQL Server participants.

After that, I want to include only full-time employees, filtering out part-time, freelance, or contract employees. That’s not a value judgment, but it does constrain my sample a bit further.

Finally, I want to filter out anybody making at least $300K a year. There are only a couple of those in the data set and they skew the results pretty hard. If you’re making $300K+ a year as a full-time employee, great; but you’re so atypical a case that I don’t care about you. Alternatively, at least one of those results might be a typo, but we’re not sure.

Now that I’ve explained the filters, I want to discuss my interesting variables. I want to explain SalaryUSD using five variables: YearsWithThisDatabase, HoursWorkedPerWeek, Education, EmploymentSector, and ManageStaff. The other variables are probably helpful as well, but for my demo, I want to keep the number of variables in play relatively low.

Now that I’ve discussed what I plan to do, let’s actually do it in many fewer lines in R:

salary_pred <- salary_data %>%
  filter(Country == "United States") %>%
  filter(PrimaryDatabase == "Microsoft SQL Server") %>%
  filter(EmploymentStatus == "Full time employee") %>%
  filter(SalaryUSD <= 300000) %>%
  select(SalaryUSD, YearsWithThisDatabase, HoursWorkedPerWeek, Education, EmploymentSector, ManageStaff)

salary_pred$Education <- as.factor(salary_pred$Education)
salary_pred$EmploymentSector <- as.factor(salary_pred$EmploymentSector)
salary_pred$ManageStaff <- as.factor(salary_pred$ManageStaff)

I now have a data set called salary_pred which includes the interesting variables. After putting in all of my filters, I still have 1624 observations, so I’m satisfied that I should have enough data to find something interesting.

Something Interesting

I am now going to build a regression tree with evtree. A regression tree is a fairly simple idea. It’s a decision tree, except that we build a regression for each leaf node. The syntax for building a genetic program which gives us a regression tree is pretty simple:

ev <- evtree(SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + Education + EmploymentSector + ManageStaff,
             data = salary_pred, minbucket = 10, maxdepth = 4)

This churns for a little bit and then, when it’s done, we can call plot(ev) to get the following image:

This image is a little busy; I actually find it easier to read the results by running ev instead:

Model formula:
SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + Education + 
    EmploymentSector + ManageStaff

Fitted party:
[1] root
|   [2] YearsWithThisDatabase < 8
|   |   [3] YearsWithThisDatabase < 5: 73507.922 (n = 244, err = 117381058953.5) | | [4] YearsWithThisDatabase >= 5: 90571.821 (n = 274, err = 126256913002.2)
|   [5] YearsWithThisDatabase >= 8
|   |   [6] EmploymentSector in Local government, State/province government, Student: 91592.034 (n = 117, err = 53016079035.9)
|   |   [7] EmploymentSector in Federal government, Private business
|   |   |   [8] ManageStaff in No
|   |   |   |   [9] YearsWithThisDatabase < 16: 105536.492 (n = 453, err = 253588315377.2) | | | | [10] YearsWithThisDatabase >= 16: 117556.179 (n = 285, err = 172506498805.9)
|   |   |   [11] ManageStaff in Yes: 123739.729 (n = 251, err = 200634714045.6)

Number of inner nodes:    5
Number of terminal nodes: 6

Let’s sum this up in a list, where each selection is exclusive of the others:

  1. If you have spent less than 5 years as a SQL Server professional, we expect your salary to be around $73,507.
  2. If you have spent between 5 and 8 years as a SQL Server professional, we expect your salary to be around $90,571.
  3. If you have spent at least eight years as a SQL Server professional and you are in local/state government or you are a student, we expect your salary to be around $91,592.
  4. If you are not a manager and have spent 8-16 years as a SQL Server professional, we expect your salary to be around $105,536.
  5. If you are not a manager and have spent at least 16 years as a SQL Server professional, we expect your salary to be around $117,556.
  6. If you are a manager and have spent at least 8 years as a SQL Server professional, we expect your salary to be around $123,739.

On the plus side, each of these nodes has at least 100 members, and most have 250+, so it’s a fairly broad distribution in each. On the minus side, the error terms seem rather large.

Thinking Linearly

To give a contrast to our regression tree, let’s try a simple linear regression.

salary_lm <- lm(SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + Education + EmploymentSector + ManageStaff,
                data = salary_pred)

There’s nothing fancy to this regression, and when I call summary(salary_lm), I get the following:

Call:
lm(formula = SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + 
    Education + EmploymentSector + ManageStaff, data = salary_pred)

Residuals:
    Min      1Q  Median      3Q     Max 
-110294  -15585   -1307   12911  144645 

Coefficients:
                                           Estimate Std. Error t value Pr(>|t|)
(Intercept)                                67621.12    6151.45  10.993  < 2e-16
YearsWithThisDatabase                       2262.35     100.98  22.403  < 2e-16
HoursWorkedPerWeek                           155.67      93.77   1.660 0.097090
EducationBachelors (4 years)                7288.25    1958.67   3.721 0.000205
EducationDoctorate/PhD                     17428.29    8201.28   2.125 0.033733
EducationMasters                           12071.38    2279.88   5.295 1.36e-07
EducationNone (no degree completed)         4931.55    2375.89   2.076 0.038083
EmploymentSectorLocal government          -21701.26    5074.87  -4.276 2.01e-05
EmploymentSectorPrivate business           -5810.89    4357.09  -1.334 0.182502
EmploymentSectorState/province government -23477.37    4985.01  -4.710 2.70e-06
EmploymentSectorStudent                   -29033.34   10683.44  -2.718 0.006646
ManageStaffYes                             11518.96    1452.14   7.932 3.99e-15
                                             
(Intercept)                               ***
YearsWithThisDatabase                     ***
HoursWorkedPerWeek                        .  
EducationBachelors (4 years)              ***
EducationDoctorate/PhD                    *  
EducationMasters                          ***
EducationNone (no degree completed)       *  
EmploymentSectorLocal government          ***
EmploymentSectorPrivate business             
EmploymentSectorState/province government ***
EmploymentSectorStudent                   ** 
ManageStaffYes                            ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23930 on 1612 degrees of freedom
Multiple R-squared:  0.3228,	Adjusted R-squared:  0.3182 
F-statistic: 69.86 on 11 and 1612 DF,  p-value: < 2.2e-16

What we see is that the linear model explains approximately 32% of the variance in salaries. As I mentioned above, I believe that relative purchasing power plays a significant role in explaining variance, but because we don’t have the data at that grain, I cannot test this conjecture.

That aside, we can see some interesting contrasts. Note that the linear model picked up local/state government and student as major negatives, knocking off at least $21K and calling local/state government statistically significant. The linear model, meanwhile, does not consider hours worked per week significant, either statistically or in terms of power. Each extra hour of work per week leads to $155 in salary per year. I’d hazard a guess that this is because most full-time data professionals are salaried employees.

Let’s Compare

Now that we have a genetic model and a linear model, we want to compare the two and see which turns out better.

First, let’s use our models to figure out predictions over the data set. Then I want to calculate the root mean square deviance for each of the two.

salary_pred$EVPrediction <- predict(ev)
salary_pred$LMPrediction <- predict(salary_lm)

sqrt(mean((salary_pred$SalaryUSD - salary_pred$EVPrediction)^2))
sqrt(mean((salary_pred$SalaryUSD - salary_pred$LMPrediction)^2))

The evolutionary regression tree has an RMSD of $23,845 whereas the linear regression has an RMSD of $23,841. In other words, they’re practically the same.

But let’s not allow aggregates to conceal the important mechanisms of each model. Let’s see how each predicts for different people.

The Lives Of Others (But Mostly Me)

In this test, I want to build up three test cases. The first test is a new graduate, right out of school with a 4-year Bachelor’s. The second test is me, after having spent 5 years working for the state. The final test is me today.

test_cases <- data.frame(
    YearsWithThisDatabase = c(0, 5, 11),
    HoursWorkedPerWeek = c(40, 40, 40),
    Education = c("Bachelors (4 years)", "Masters", "Masters"),
    EmploymentSector = c("Private business", "State/province government", "Private business"),
    ManageStaff = c("No", "No", "Yes")
)
YearsWithThisDatabase HoursWorkedPerWeek Education EmploymentSector ManageStaff
0 40 Bachelors (4 years) Private business No
5 40 Masters State/province government No
11 40 Masters Private business Yes

We can see how each of the models would predict the people in this sample.

predict(ev, test_cases)
predict(salary_lm, test_cases)

In the first case, for the new person, both models are in harmony: the evolutionary regression tree came up with a value of $73,507 and the linear regression was $75,325.

In the second case, the evolutionary regression tree came up with $90,571 and the linear regression was $73,753. That’s a huge difference between the two!

In the third case, we see another fairly big difference: $123,739 for the evolutionary model versus $116,513 for the linear model.

Even though the two models have the same top-line deviance from the mean, it’s clear which model I prefer!

Conclusion

This wraps up genetic programming.  There’s a lot more to the topic—after all, John Koza wrote numerous books on it!—but hopefully that foments some consideration on where it might be useful to use a genetic programming algorithm.

Genetic Programming: Mathematical Functions

This is part of a series entitled Genetics In Action.

In today’s post, I am going to look at two examples of deriving a mathematical function.  I will use the rgp package in R.  It appears that nobody is maintaining the rgp package anymore, so this isn’t something I’d use in a production scenario, but it does offer us a fairly easy introduction.

Tell Me What I Know

In this first example, I will look at a case where we hope to see the genetic program approximate a known function as closely as possible. You can think of this as a supervised test, where we have a desired end state and want to see how well the genetic program’s results fare in comparison.

A Craftsman Never Blames His Tools

if(!require(rgp)) {
    install.packages("rgp", repos = "http://cran.us.r-project.org")
    library(rgp)
}

function_set <- functionSet("+", "*", "-")
input_variable_set <- inputVariableSet("x")
constant_factory_set <- constantFactorySet(function() { rnorm(1) })

In this first bit of code, we load the rgp package and set up some of our constraints. I want to derive a mathematical formula using only the addition, multiplication, and subtraction operators. As a note, division can be troublesome in a genetic program because of the chance of dividing by zero.

Aside from mathematical operators, we have one independent variable in the mix: x. Now we need to figure out our fitness function.

function_interval <- seq(from = -pi, to = pi, by = 0.1)
fitness_function <- function(f) { rmse(f(function_interval), sin(function_interval)) }

Our end goal is a sine wave that runs from -pi to pi. Our fitness function is the Root Mean Square Error of our generated function versus the ideal of the sine wave, meaning that we want to minimize the result of our fitness function, with a value of 0 matching the sine wave exactly. Now that we have everything in place, let’s kick off the process.

result <- geneticProgramming(
  functionSet = function_set,
  inputVariables = input_variable_set,
  constantSet = constant_factory_set,
  fitnessFunction = fitness_function,
  stopCondition = makeTimeStopCondition(20) #run for 20 seconds
)

This will run for 20 seconds, during which time you will hopefully find functions which converge closer and closer to the ideal. In my case, it looked like this:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 0.706164, time elapsed: 0.84 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 0.445428, time elapsed: 1.4 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 0.445256, time elapsed: 2.42 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 0.445256, time elapsed: 3.53 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 0.445256, time elapsed: 4.32 seconds
evolution step 600, fitness evaluations: 29950, best fitness: 0.445256, time elapsed: 5.26 seconds
evolution step 700, fitness evaluations: 34950, best fitness: 0.445256, time elapsed: 6.15 seconds
evolution step 800, fitness evaluations: 39950, best fitness: 0.445256, time elapsed: 6.75 seconds
evolution step 900, fitness evaluations: 44950, best fitness: 0.445210, time elapsed: 7.59 seconds
evolution step 1000, fitness evaluations: 49950, best fitness: 0.445210, time elapsed: 8.29 seconds
evolution step 1100, fitness evaluations: 54950, best fitness: 0.445210, time elapsed: 8.9 seconds
evolution step 1200, fitness evaluations: 59950, best fitness: 0.445210, time elapsed: 9.81 seconds
evolution step 1300, fitness evaluations: 64950, best fitness: 0.445210, time elapsed: 12.35 seconds
evolution step 1400, fitness evaluations: 69950, best fitness: 0.445210, time elapsed: 13.09 seconds
evolution step 1500, fitness evaluations: 74950, best fitness: 0.445210, time elapsed: 14.81 seconds
evolution step 1600, fitness evaluations: 79950, best fitness: 0.445198, time elapsed: 15.48 seconds
evolution step 1700, fitness evaluations: 84950, best fitness: 0.445198, time elapsed: 16.17 seconds
evolution step 1800, fitness evaluations: 89950, best fitness: 0.445198, time elapsed: 17.29 seconds
evolution step 1900, fitness evaluations: 94950, best fitness: 0.445198, time elapsed: 18.01 seconds
evolution step 2000, fitness evaluations: 99950, best fitness: 0.445198, time elapsed: 18.68 seconds
evolution step 2100, fitness evaluations: 104950, best fitness: 0.445198, time elapsed: 19.31 seconds
Genetic programming evolution run FINISHED after 2138 evolution steps, 106850 fitness evaluations and 20 seconds.

There was some convergence, but it wasn’t the best. Let’s take a look at the best program:

best_solution <- result$population[[which.min(result$fitnessValues)]]
best_solution
min(result$fitnessValues)

That spits back the following:

function (x) 
0.305371214275836 * (x - -0.0445197474230304)
0.445198186160344

The reduced form function is 0.305x + 0.013595. That’s a linear function with respect to x, and the RMSE is 0.445, which is not great.

Here’s a plot of our solution versus the ideal solution:

plot(y = best_solution(function_interval), x = function_interval, type = "l", lty = 1, xlab = "x", ylab = "y")
lines(y = sin(function_interval), x = function_interval, lty = 2)

Genetic program with a one-track mind.

A Craftsman Goes Out And Gets Better Tools

That first result wasn’t very good, and part of the problem is that we did not give the genetic program the right tools to figure out a solution to our problem. So let’s change the function set just a little bit:

function_set <- functionSet("+", "*", "-", "cos")

You and I know that cosine and sine have a direct relationship: shift sine by pi/2 and you get cosine. Let’s see if the genetic program figures out the same. This time around, I ran the function for 7 seconds, and we can see the improvement:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 0.458847, time elapsed: 0.64 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 0.445672, time elapsed: 1.32 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 0.124575, time elapsed: 2 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 0.003608, time elapsed: 2.64 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 0.003608, time elapsed: 4.41 seconds
evolution step 600, fitness evaluations: 29950, best fitness: 0.003608, time elapsed: 5.22 seconds
evolution step 700, fitness evaluations: 34950, best fitness: 0.003608, time elapsed: 5.97 seconds
evolution step 800, fitness evaluations: 39950, best fitness: 0.003608, time elapsed: 6.71 seconds
Genetic programming evolution run FINISHED after 840 evolution steps, 41950 fitness evaluations and 7 seconds.

For the first couple seconds, the algorithm wasn’t doing much better than it had last time around, but then something clicked and the best solution got a lot better. Let’s take a look at it.

best_solution <- result$population[[which.min(result$fitnessValues)]]
best_solution
min(result$fitnessValues)
function (x) 
cos(x - 0.291574502599581 - 1.2741260116152)
0.00360807823117699

In other words, this is cos(x – 1.5657), which is very close to pi/2. If we plot the curve, we’ll see an almost-exact match.

plot(y = best_solution(function_interval), x = function_interval, type = "l", lty = 1, xlab = "x", ylab = "y")
lines(y = sin(function_interval), x = function_interval, lty = 2)

Turns out cosine and sine are actually related.

Finding The Best Tools

We can take this one step further and have the function set include all basic math + trigonometric functions like so:

function_set <- trigonometricFunctionSet

In rgp, this includes the set { sin, cos, tan }. When we run the function with this set, it finds the best solution almost immediately: sin(x).

Tell Me What I Don’t Know

Now we’re going to shake things up a bit. Instead of trying to match a function, we will take a sample of data collected in the wild and attempt to fit a function to it.

if(!require(rgp)) {
    install.packages("rgp", repos = "http://cran.us.r-project.org")
    library(rgp)
}

if(!require(igraph)) {
  install.packages("igraph", repos = "http://cran.us.r-project.org")
  library(igraph)
}

We will still use rgp, but I’m also going to introduce igraph, which allows us to visualize graphs easily.

Unlike before, we don’t have a function in mind; rather, we have some x-y coordinates.

input_data <- data.frame(
    x = c(1, 2, 3, 4, 5, 6, 7),
    y = c(0, 26, 88, 186, 320, 490, 541)
)

If we use plot(x,y) to plot this, we get the following image:

Did you know that disco record sales were up 400% for the year ending 1976?  If these trends continue…AY!

Let’s turn this set of numbers into a plausible example. Let’s say we have a rocket car we’re testing, and the y axis represents speed in miles per hour, whereas the x axis represents seconds. What we see is an immediate burst and exponential growth for the first few seconds, but then we hit an inflection point and start tailing off near the end. If we drew this out for a few more seconds, we’d probably see the y value flatline somewhere under 600.

Now that we have our just-so story in place, let’s build our process. We will use the basic arithmetic operators and perform a symbolic regression of y against x, running for 500 steps.

function_set <- functionSet("+", "-", "*")
result <- symbolicRegression(
    y ~ x,
    data = input_data,
    functionSet = function_set,
    stopCondition = makeStepsStopCondition(500)
)

We could run for a specific period of time as well; rgp has several stop conditions available. Anyhow, here’s how the results look in my particular run:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 33.007936, time elapsed: 3.75 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 33.007936, time elapsed: 7.18 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 33.007936, time elapsed: 10.36 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 33.007936, time elapsed: 13.53 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 33.007936, time elapsed: 16.9 seconds
Genetic programming evolution run FINISHED after 500 evolution steps, 24950 fitness evaluations and 16.9 seconds.

Looks like we hit a pretty decent result early but couldn’t improve on it in the allocated number of operations. Here’s what our result looks like:

plot(input_data$y, col=1, type="l")
points(predict(result, newdata = input_data), col=2, type="l")

No, don’t worry, we can forecast outside of the prediction range no problem!

And here’s the formula that our best result generates:

best_result <- result$population[[which.min(result$fitnessValues)]]
best_result
8.87532839551568 * (-1.28207603423574 * -1.08413319016388) * x * x

Simplifying that, we have approximately 12.336x^2. That’s a pretty good fit for the data set, though it does not capture the falloff near the end and would perform poorly in our out-of-sample data (assuming, again, that subsequent results continue the late trend of trailing off toward a maximum value under 600).

1000 Words Coming Up

I’d like to display the result as a graph, because now that I’m in management, I find myself incapable of understanding anything that does not have a picture associated with it. The rgp package has a nice function called funcToIgraph() built in, which allows you to turn best_result into a graph that igraph can read and display. There’s a bug in rgp, though: when rgp was built, igraph used 0-indexed notation, but they subsequently shifted to 1-based notation. Therefore, rgp has an off-by-one bug. To get around this, I wrote my own function that I can use in this demo:

exprToIgraph <- function(expr) {
  if (!require("igraph")) stop("exprToIgraph: Package 'igraph' not installed.")
  exprGraph <- exprToGraph(expr)
  exprIgraph <- graph(exprGraph$edges, n = length(exprGraph$vertices)) # igraph vertexes used to be counted from zero but no longer are!
  V(exprIgraph)$label <- exprGraph$vertices
  exprIgraph
}

I’d happily submit it as a pull request, but I could only find mirrors and forks of rgp and not any active repository.

Anyhow, let’s satisfy the manager inside and turn this formula into a picture:

final_graph <- exprToIgraph(body(best_result))
plot(final_graph, layout=layout_as_tree, vertex.color="orange", vertex.size=30, edge.arrow.size=0.1)

Somebody out there is mentally picturing this graph in Reverse Polish Notation.

There we go; now I’m rejuvenated and ready to demand some more TPS reports.

Conclusion

Today’s post was a look at a pair of genetic programs which derive mathematical functions. This is one of the main uses of genetic programming, but it’s not the only one. In tomorrow’s post, I will look at another major use case.

Genetic Programming

This is part of a series entitled Genetics In Action.

Up to this point, we have looked at genetic algorithms, one particular evolutionary algorithm.  Now we’re going to take a look at a different evolutionary algorithm called genetic programming.  John Koza popularized genetic programming with his eponymous series of books, starting with Genetic Programming and on through volumes II, III, and IV.

I want to start with points where genetic programming and genetic algorithms overlap, and then we’ll look at the gigantic difference.

More Of The Same

Genetic programs, like genetic algorithms, work from a population of candidate solutions.  Each candidate solution has a chromosome (generally just one), made up of genes.  Genes are the building blocks of the chromosome and the potential values of each gene are its alleles.

In order to determine which candidate solution is the best, we need a fitness function, which we generally implement as a single numeric value.  To get from our starting population to the fittest candidate solution, we take advantage of the crossover and mutation operators and run this process for some number of iterations.

In this regard, genetic programs and genetic algorithms are alike.

But Wait, There’s More!

A big difference between a genetic algorithm and a genetic program  is that genetic programs typically do not have fixed-length chromosomes.  Genetic algorithms do not need to have fixed-length chromosomes either, but this is the norm, at least in Easy Mode.  With genetic programs, even in the easiest setup, we don’t assume fixed numbers of genes.

The other big difference is in the name:  we’re building programs.  In the early genetic programming literature, Koza used Lisp as his language of choice for genetic programs because it has a tree-like syntax that really works well here.  We won’t use Lisp ourselves, but let us take a moment to mourn all of those parentheses who gave their lives in order for us to solve a problem.

Anyhow, the programs can be as simple as mathematical functions or as complex as instruction sets for machinery.  Each program can be displayed as a graph.  I’m going to look at two separate scenarios, focusing mostly on problems in graph format (to make it easy to follow).  First up is mathematical functions, followed by conditional programs.

Mathematical Functions

Let’s say that we want to find the best result to a fitness function.  We have a really, really simple fitness function that always returns (56 – X)^2, and we want to find the global minimum.  We have integer numbers from 0-9 available to us, as well as the mathematical operations +, -, and x.

One potential solution could look like this:

Didn’t even need to count on my fingers for this one.

In this case, we multiply 8 x 7 and get 56.  (56-56)^2 = 0, which is our global minimum.  We’ve solved the problem!

With a genetic program, it’s usually not going to be that parsimonious.  Instead, genetic programs often will have somewhat more noisy answers:

Parsimony? Sounds like a vegetable. Genetic programs *hate* vegetables.

The answer solves our fitness function all the same, so we should be happy with those results.  But if you want to understand the solution, you’ll often need to reduce the outcome to its simplest form.

Something to notice is that in this mathematical function, the mathematical operators are non-leaf nodes, whereas numeric values are leaf nodes.  If we introduced variables like m and n, those variables could be on leaf nodes as well, but they would not show up on non-leaf nodes.

Conditional Programs

The other major type of genetic program is a conditional program.  The end result here is not a mathematical formula, but rather a decision.

Here’s an old, old sample from my master’s thesis:

It made sense then, honest.

In this sample, we have non-leaf conditionals which lead to leaf node decisions.  This is a simple problem based on a variant of the Prisoner’s Dilemma.  Let’s describe this agent’s behavior.  If the agent defected two turns ago, then check what the opponent agent did two turns ago.  If the opponent defected two turns ago, then exit the game; otherwise, if the opponent cooperated two turns ago, defect.  Finally, if the agent did not defect two turns ago, defect this turn.  This agent’s kind of a jerk.

Again, this is an example of a simplified tree.  A more realistic scenario looks a bit like another example from my thesis:

(own_prev x (opp_ever_def (opp_prev x (own_prev2
x d)) (own_prev (own_prev2 (opp_ever_def
(opp_prev x (own_prev2 (own_prev2 x c) d))
(opp_ever_def d (opp_prev2 (good_res x x)
c))) d) (opp_ever_def (opp_ever_def (own_prev
x x) (good_res (opp_prev d c) (own_prev (opp_ever_def
(opp_prev x (own_prev2 (opp_ever_def (good_res
(opp_prev2 (good_res x x) c) x) (opp_prev2
(own_prev2 x c) c)) d)) (opp_ever_def (own_prev
x x) (own_prev2 (good_res (own_prev x x)
(own_prev (opp_ever_def (own_prev2 x c) (good_res
(opp_ever_def (good_res d d) (own_prev2 x
c)) x)) (good_res x x))) c))) (good_res x
(own_prev (own_prev x x) (opp_ever_def (opp_prev
x (own_prev2 (own_prev (good_res (good_res
d d) (opp_ever_def (good_res d d) (own_prev2
x c))) (opp_prev2 (good_res x x) c)) d))
(own_prev2 x c))))))) (own_prev2 x c)))))

Nope, not going to draw a graph for that one…  It simplifies down to the following:

(own_prev  x  (opp_ever_def  (opp_prev  x  (own_prev2  x  d))) (own_prev2  x  c))

If you go through the program, you’ll see areas that we can simplify:  contradictory code branches (e.g., if you defected last turn and if you did not defect last turn), redundant results (if you defected last turn, then defect this turn; otherwise, defect this turn), and the like.

Detailing The Operation

So we’ve looked at mathematical functions and conditional programs, but we haven’t quite described the mechanics behind how genetic programs form new candidate solutions over the generations.  It’s easiest to think about this in graph mode, so let’s start with a pair of candidate solutions.

How I met your generational antecedent.

Let’s suppose we want to combine these two programs.  What we would do is find a subtree and perform a lop-and-splice technique, which is totally different from the chop-and-swap of genetic algorithms.  Our job is to cut off a subtree from each of the two parents and splice the new subtrees in.

Don’t worry, I’m a trained professional. I hardly ever lop of the wrong branch.

In this case, we’re swapping the right subtree under x on the left-hand side with the left subtree under + on the right-hand side.  The subtrees do not need to be the same size in order to swap.  The subtrees do not need to be at the same level in order to swap.  And the subtrees do not even need to contain non-leaf nodes.

Once we’re done, it’s time to splice the subtrees, creating two brand new programs.

Just a little bit of duct tape and you’re good as new.

We now have two new programs, which means crossover is complete.  Mutation is similar to the genetic algorithms example.  We can mutate non-leaf nodes as well as leaf nodes, but we need to follow the rules of what’s allowed to be where.

Something in the water keeps causing all of these mutations.

The left-hand program experienced two mutations, whereas the right-hand tree experienced none.  Now we have two new programs and can continue the process.  Similar to genetic algorithms, we can run genetic programs for a certain amount of time, a certain number of generations, a certain number of operations, or until we get the result we’re expecting.

Conclusion

This was a cursory introduction to genetic programming.  Next, I’m going to showcase a couple examples of genetic programming in R.

Plot SQL Server CPU Usage With R

EDIT 2017-06-16: I revised the script a little bit to add in total CPU as well, making it easier to see when you’re at 100% utilization.

This is a quick way of grabbing the last 256 minutes of CPU usage data from SQL Server and plotting it using R.

The SQL Server script comes from Glenn Berry’s outstanding set of DMV queries.  Here’s the R code:

if(!require(RODBC)) {
    install.packages("RODBC")
    library(RODBC)
}

if(!require(ggplot2)) {
    install.packages("ggplot2")
    library(ggplot2)
}

if(!require(tidyverse)) {
    install.packages("tidyverse")
    library(tidyverse)
}

# Fill in your own connection string details here
conn <- odbcDriverConnect("Driver=SQL Server;Server=RDU-SQL-R-01.auctionrover.com;Initial Catalog=DemandForecaster;Provider=SQLNCLI11.1;Integrated Security=SSPI")

# This is Glenn Berry's query to get CPU percentage data for the last 256 minutes.
# https://www.sqlskills.com/blogs/glenn/category/dmv-queries/
cpu_query <- paste("DECLARE @ts_now BIGINT = (SELECT cpu_ticks / (cpu_ticks / ms_ticks) FROM sys.dm_os_sys_info WITH (NOLOCK) );

SELECT TOP (256)
	SQLProcessUtilization AS SQLCPU,
	SystemIdle AS Idle,
	100 - SystemIdle - SQLProcessUtilization AS OtherCPU,
	DATEADD(ms, -1 * (@ts_now - [TIMESTAMP]), GETDATE()) AS EventTime
FROM (
		SELECT
			record.value('(./Record/@id)[1]', 'int') AS record_id,
			record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int')
			AS [SystemIdle],
			record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int')
			AS [SQLProcessUtilization],
			[TIMESTAMP]
		FROM (
			SELECT
				[TIMESTAMP],
				CONVERT(XML, record) AS [record]
			FROM sys.dm_os_ring_buffers WITH (NOLOCK)
			WHERE
				ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
				AND record LIKE N'%<SystemHealth>%'
		) AS X
	) AS Y
ORDER BY
	record_id DESC OPTION (RECOMPILE);")

cpu_usage_by_minute <- sqlQuery(conn, cpu_query)

cpu_usage_by_minute$TotalCPU <- cpu_usage_by_minute$SQLCPU + cpu_usage_by_minute$OtherCPU
cpu_usage_by_minute$EventTime <- lubridate::ymd_hms(cpu_usage_by_minute$EventTime)
cpu_usage_by_minute_plot_df <- cpu_usage_by_minute %>%
    gather(key = CPUType, value = PercentUse, SQLCPU, OtherCPU, TotalCPU) %>%
    select(EventTime, CPUType, PercentUse)

ggplot(cpu_usage_by_minute_plot_df, aes(x = EventTime, y = PercentUse, color = CPUType)) +
    geom_line(stat='identity')

close(conn)

At the top of the script, we have a construct to load a package if it already exists; if it doesn’t already exist, install and load it.  The packages we will use today are RODBC (to connect to SQL Server), ggplot2 (to visualize the data), and tidyverse (to structure the data in a nicer format).  Here is an example of the script in action:

On my instance, I have SQL Server R Services installed and am hammering away at the box with it, training a bunch of neural nets concurrently.  We can see that I got started with this sometime around 10:30 AM, stopped it for a few minutes, and then went nuts around 10:45 AM.

Genetic Algorithms In R: The Holyfield Problem

This is part of a series entitled Genetics In Action.

In the last post, we looked at solving a basic genetic algorithms problem:  finding a global maximum for a function consisting of a single real-valued variable.  Today, we are going to look at a more complex interaction and solve one of life’s more interesting challenges:  the knapsack Holyfield problem.

The Holyfield Problem

The Holyfield Problem stems from the classic Sega Genesis video game, Evander Holyfield’s Real Deal Boxing.

Evander Holyfield’s Real Deal Boxing spurred a number of PhD theses back in its day.

The purpose of this game was as complicated as it was challenging:  to make the best boxer there ever was and defeat your father figure in the ring.  There’s a Freudian element in there I’m not particularly comfortable with, but I don’t get to choose the plots of video games.

Anyhow, as mentioned, you want to be the best there is, and the only way to be the best is to train the best.  That’s why the game offers you a number of training opportunities, each of which builds up at least one of the four core skills necessary to punch large men repeatedly in a ring.

An artist’s rendering of the author, shortly after the author slipped the artist a $50.

Above, we see eight of the options available, and of the three I’ve selected (punching bag, jump rope, and loose weights), what they would do for my namesake.  All in all there are 19 training options available, and figuring out the best combination sounds like a job for a genetic algorithm!

Defining The Problem

Let’s take a moment and build out a proper definition of the problem.  As mentioned, we have 19 potential training options available to us.  Each training option has some positive effect on at least one of four stats:  power, stamina, speed, and defense.  Each training option also comes at a cost, and we have a fixed budget.  Our goal is to maximize a fighter fitness function (which I’ll describe below) contingent upon remaining at or below the total allowable cost.

Note that the search space here is the product set of alleles.  We have 19 genes, each of which has 2 alleles, so it’s 2^19.  That means that there are 524,288 potential training regimens.  Many of these will be over cost, but that still leaves a large number to sift through.

Building The Chromosome

Each organism in this example will have a single chromosome with 19 genes, representing each of the 19 training options.  Because we select training options without replacement, the best representation for each gene is a binary set of alleles:  either we have selected the training option or we have not selected it; we cannot select it twice.

Let’s start writing some code:

if(!require(GA)) {
    install.packages("GA", repos = "http://cran.us.r-project.org")
    library(GA)
}

if(!require(ggplot2)) {
    install.packages("ggplot2", repos = "http://cran.us.r-project.org")
    library(ggplot2)
}

training_options <- data.frame(
  item = c("Exercycle", "Head Guard", "Health Club", "Iron Gum Shield", "Jog Machine", "Jump Rope", "Karate",
           "Loose Weights", "Multi Gym", "Power Gloves", "Protein Diet", "Punch Bag", "Running Shoes", "Sparring",
           "Speed Bag", "Speed Boots", "Step-O-Matic", "Track Work", "Vitamins"), 
  stamina = c(2, 0, 1, 0, 2, 1, 0, 0, 1, 0, 1, 2, 2, 0, 1, 0, 2, 1, 1),
  speed =   c(1, 0, 1, 0, 1, 1, 2, 0, 1, 0, 1, 0, 1, 0, 1, 4, 0, 2, 1),
  power =   c(0, 0, 1, 0, 0, 0, 0, 2, 1, 3, 1, 2, 0, 0, 0, 0, 2, 0, 1),
  defense = c(0, 2, 1, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0),
  cost =    c(2, 2, 3, 1, 3, 1, 2, 1, 3, 3, 2, 2, 1, 3, 2, 3, 3, 2, 2))
training_balance <- 8

We have a total training budget of 8 and can see the training options available.  Let’s view this in a table format, as that’s easier to read:

item stamina speed power defense cost
Exercycle 2 1 0 0 2
Head Guard 0 0 0 2 2
Health Club 1 1 1 1 3
Iron Gum Shield 0 0 0 3 1
Jog Machine 2 1 0 0 3
Jump Rope 1 1 0 0 1
Karate 0 2 0 3 2
Loose Weights 0 0 2 0 1
Multi Gym 1 1 1 0 3
Power Gloves 0 0 3 0 3
Protein Diet 1 1 1 0 2
Punch Bag 2 0 2 0 2
Running Shoes 2 1 0 0 1
Sparring 0 0 0 4 3
Speed Bag 1 1 0 0 2
Speed Boots 0 4 0 0 3
Step-O-Matic 2 0 2 0 3
Track Work 1 2 0 0 2
Vitamins 1 1 1 0 2

Each option has a cost between 1 and 3 points and generates at least 1 point of skill.  Here’s a sample chromosome, where I selected a few options:

sample_chromosome = c(1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
training_options[sample_chromosome == 1, ]

The sample_chromosome is a vector of options, and we feed those sample chromosome values into the training_options data frame to extract only the options I selected:

item stamina speed power defense cost
1 Exercycle 2 1 0 0 2
4 Iron Gum Shield 0 0 0 3 1
5 Jog Machine 2 1 0 0 3
11 Protein Diet 1 1 1 0 2

This is a valid set of options, in that its cost is within our training budget.  How good is it, though?  Well, in order to figure that out, we need to define a fitness function.

We Pump And We Pump

Our fitness function will explain to us just how good each candidate solution is.

EvaluateBoxerTraining <- function(chromosome) {
  # Each of the training stats follows a log form, so a balanced boxer is better.
  # Then, we'll emphasize speed over power over defense over stamina because reasons.
  eval_speed <- 2.2 * log2(chromosome %*% training_options$speed)
  eval_power <- 1.9 * log2(chromosome %*% training_options$power)
  eval_defense <- 1.6 * log2(chromosome %*% training_options$defense)
  eval_stamina <- 1.3 * log2(chromosome %*% training_options$stamina)
  
  eval_score <- eval_speed + eval_power + eval_defense + eval_stamina
  eval_cost <- chromosome %*% training_options$cost if (eval_cost > training_balance)
    return(0)
  else
    return(eval_score)
}

Let’s walk through this for a moment. Starting with speed, we can see that there’s a function to determine the speed score. That function takes the natural log of chromosome %*% training_options$speed. This is matrix multiplication of the chromosome vector times the speed vector in the training_options data frame. Going back to linear algebra, multiplying a 1xN matrix by a 1xN matrix gives us a 1×1 result. Because our chromosome is made up of 0s and 1s, what we end up doing is adding up all of the speed points that we’ve selected, so that’d be 3 in our sample chromosome above (1×1 + 0x1 + 1×1 + 1×1). As another example, for defense, the calculation is the same: (1×0 + 1×3 + 1×0 + 1×0 = 3). This is a nice way of eliminating an unnecessary for loop in R, and when you’re doing these calculations thousands of times, it can shave off enough time to make this worth knowing.

So we perform matrix multiplication to get the sum of our speed bonuses, which is 3. Then we take the natural log of 3 (which is 1.0986) and multiply it by 2.2, giving us a speed score of 2.4169.

Our fitness function is the summation of these four scores: the scores for speed, power, defense, and stamina. If we go over the training balance, we return 0 for the fitness score, as it’s an invalid combination; otherwise, we return the fitness score we calculated.

As a quick aside, the reason I take the natural log of each of these scores is that it gives us a nice property in R: if the speed value is 0, then the result is -Inf, which we can guarantee will not be the best answer. This requires our boxers to have a minimum value in all four attributes.

Looking at this fitness function, it seems that we want ceteris paribus to emphasize speed and power over defense and stamina. Speed and power give us better bonuses, so picking options like speed boots and power gloves would seem to be winners.

Going back to our sample, if I evaluate the sample’s fitness using EvaluateBoxerTraining(sample_chromosome), I get back a score of 9.041364. Is that a good score or a bad score? Who knows? We’ll have to test it against other scores to figure out if we’ve accidentally-not-accidentally picked the best set of training options.

Starting A Boxing School

Now we want to build up our genetic algorithm and put tens of thousands of boxers to the test.  Specifically, I am going to look at a population size of 160 candidate solutions over 100 generations, for a total of up to 16,000 searches—so if we looked at a random search mechanism with the same number of steps, we’d hit just over 3% of the total search space.

Our model is simple, though there are a couple of differences from the first genetic algorithm we saw.

model <- ga(type = "binary", fitness = EvaluateBoxerTraining, popSize= 160,
            run = 50, nBits = 19, pcrossover = 0.8, pmutation = 0.2)
summary(model)

We are using binary genes, so the type is set to binary rather than real. We have defined our fitness function, and then have set a few parameters. Our population size is 160 boxers, and each boxer has a chromosome with 19 bits. Our crossover probability is 80% and our mutation rate is 20%. Finally, we will stop the simulation if the most-fit boxer has not improved in at least 50 generations. This gives us a short-circuit opportunity in case we find the best early on.

Viewing the summary, we end up with a result something like this:

+-----------------------------------+
|         Genetic Algorithm         |
+-----------------------------------+

GA settings: 
Type                  =  binary 
Population size       =  160 
Number of generations =  100 
Elitism               =  8 
Crossover probability =  0.8 
Mutation probability  =  0.2 

GA results: 
Iterations             = 69 
Fitness function value = 15.35445 
Solution = 
     x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19
[1,]  0  0  0  1  0  1  1  1  0   0   0   1   1   0   0   0   0   0   0

We went a total of 69 generations and picked a training set which gives us a fitness score of 15.35445, way better than the 9.041364 we had above. We can look at the solution to see exactly which options the algorithm chose:

solution <- model@solution
solution_options <- training_options[solution == 1, ]
solution_options
item stamina speed power defense cost
4 Iron Gum Shield 0 0 0 3 1
6 Jump Rope 1 1 0 0 1
7 Karate 0 2 0 3 2
8 Loose Weights 0 0 2 0 1
12 Punch Bag 2 0 2 0 2
13 Running Shoes 2 1 0 0 1

Neither of our big speed/power boosts show up.  Furthermore, using colSums(solution_options[2:6]) to sum up each column, the results are surprising:

stamina 5
speed 4
power 4
defense 6

Instead of skewing toward speed and power, the best solution actually skews toward defense and stamina!

Now that we’ve found the best solution, let’s look at the evolution of our solution space over time:

“You’re a bum, you’re always been a bum, and you’ll never be anything but a bum!” The GA coach is a mean coach.

Notice that there are several stair-step increases in the genetic algorithm results.  In the first few generations, the best score was a 0.  Eventually, we jump up to 10, and then a few jaunts up to the 13-14 range before settling on our final score of 15.35445.  It took more than 35 generations for the median to reach our peak fitness, and notice the erratic behavior of the mean:  that’s because we have cases with -Inf results dragging down the mean.  But just like before, we’re not concerned with the specific values of anything except the best-fit result, so that’s fine.

Conclusion

This was a more complicated genetic algorithm, but it shows a classic example of how to solve a problem.  We need to be able to define the problem in terms of genes, alleles, and a fitness function.  If we can do that, then we will be able to feed those into the genetic algorithm.  Many problems are not as easy to fit into a genetic algorithm as our problem was, but if you’re able to define the problem in these terms, you can solve a complicated optimization problem while looking at a tiny fraction of the entire search space.

One nice property that I did not emphasize in the above example is that genetic algorithms don’t work on heuristics like we do.  In my explanation, I gave what I thought would be the best solution heuristic:  look for high-speed and high-power training options and focus on them.  It turns out that this was not in fact a winning strategy.  We don’t really have a way of telling the genetic algorithm to look in a certain part of the search space (not directly, at least), and that’s a good thing.  Although you might think that narrowing the search space would help the algorithm, what it really does is make it more likely that we find a local maximum instead of the global maximum.

Next up on our agenda is genetic programming.

Using Genetic Algorithms In R

This is part of a series entitled Genetics In Action.

In today’s post, we are going to look at the GA package in R, which implements genetic algorithm.  This package allows us to build different types of genetic algorithms in just a few lines of code.  Let’s start with an easy example.

Maximizing A Mathematical Function

Our first example involves finding the global maximum for a mathematical function.  Here’s the setup code:

if(!require(GA)) {
    install.packages("GA", repos = "http://cran.us.r-project.org")
    library(GA)
}

f <- function(x) { (x^2+x)*cos(x) }
min <- -10; max <- 10

The first bit loads the GA library if it does not exist. Then, we generate a fitness function called f. This fitness function tells us how well we’re doing for any value of x. We also have a couple of constraints, specifically that x must be in the range [ -10, 10 ].

Let’s view the curve that this fitness function generates by using the curve function like so : curve(f, min, max, n = 1000).

Child, this is the shape of our domain.  We own everything the eye can see, at least until you get to double digits.

Our goal is to find the value x which maximizes f(x) over the relevant range.  Eyeballing the problem, we can see that the maximum value is somewhere around 6.5, but note that there’s a local maximum somewhere around -6.5.  Many hill-climbing algorithms will get stuck around -6.5 if you start out near that point, so we want to make sure that our genetic algorithm doesn’t get stuck there.

To build the genetic algorithm and run its test, call the ga function:

GA <- ga(type = "real-valued", fitness = f, min = min, max = max, monitor = FALSE)
summary(GA)

We selected a type of real-valued, which means that our genes are numeric values with decimals. Our fitness function is the function f, which we defined above, and the constraint is that we must keep our genes between min and max. After the algorithm runs, we get back a summary:

+-----------------------------------+
|         Genetic Algorithm         |
+-----------------------------------+

GA settings: 
Type                  =  real-valued 
Population size       =  50 
Number of generations =  100 
Elitism               =  2 
Crossover probability =  0.8 
Mutation probability  =  0.1 
Search domain = 
     x1
Min -10
Max  10

GA results: 
Iterations             = 100 
Fitness function value = 47.70562 
Solution = 
           x1
[1,] 6.560509

This tells us all about the genetic algorithm. We can see that the defaults are 50 population members, 100 generations, a crossover probability of 80%, and a mutation probability of 10%. We can see that we went through 100 iterations and found that the point 6.56 has a fitness of 47.70562, which is the highest possible score.

If you want to visualize what the algorithm did over time, calling plot(GA) will generate a plot for you:

Plotting the genetic algorithm’s evolution.

There are three things measured here:  the best result, the mean result, and the median result.  In this case, we ended up hitting the best result in generation 1, and that best result stuck around until the end of the simulation.  Normally, we don’t hit the global maximum right from the beginning, but we got lucky this time.  Notice that the median population fitness score eventually converges on the global maximum, but the mean fitness score never really makes it that high.  This is actually a good thing for us:  it shows that at 100 generations, there is still enough genetic diversity that we are building population members which search across the valid space just in case our current best isn’t really the best.  In a non-trivial problem, it could take dozens or hundreds of generations to converge, and if you run out of viable alternatives too early, you’ll never get past the local minimum.  But once the simulation is over, we really only care about the single most fit solution.

Let’s plot our best solution now:

curve(f, min, max, n = 1000)
points(GA@solution, GA@fitnessValue, col = 2, pch = 19)

Our best solution is the global maximum.

We were able to find the global maximum, just as we wanted.

Conclusion

In this post, we were able to use a genetic algorithm to solve a single-variable maximization problem.  It’s not the most exciting problem to solve, but it does give us an introduction into the mechanics of the GA package.  In the next post, we will look at another classic optimization problem.