*This is part of a series entitled Genetics In Action.*

In today’s post, I am going to look at two examples of deriving a mathematical function. I will use the rgp package in R. It appears that nobody is maintaining the rgp package anymore, so this isn’t something I’d use in a production scenario, but it does offer us a fairly easy introduction.

### Tell Me What I Know

In this first example, I will look at a case where we hope to see the genetic program approximate a known function as closely as possible. You can think of this as a supervised test, where we have a desired end state and want to see how well the genetic program’s results fare in comparison.

#### A Craftsman Never Blames His Tools

if(!require(rgp)) {
install.packages("rgp", repos = "http://cran.us.r-project.org")
library(rgp)
}
function_set <- functionSet("+", "*", "-")
input_variable_set <- inputVariableSet("x")
constant_factory_set <- constantFactorySet(function() { rnorm(1) })

In this first bit of code, we load the rgp package and set up some of our constraints. I want to derive a mathematical formula using only the addition, multiplication, and subtraction operators. As a note, division can be troublesome in a genetic program because of the chance of dividing by zero.

Aside from mathematical operators, we have one independent variable in the mix: x. Now we need to figure out our fitness function.

function_interval <- seq(from = -pi, to = pi, by = 0.1)
fitness_function <- function(f) { rmse(f(function_interval), sin(function_interval)) }

Our end goal is a sine wave that runs from -pi to pi. Our fitness function is the Root Mean Square Error of our generated function versus the ideal of the sine wave, meaning that we want to **minimize** the result of our fitness function, with a value of 0 matching the sine wave exactly. Now that we have everything in place, let’s kick off the process.

result <- geneticProgramming(
functionSet = function_set,
inputVariables = input_variable_set,
constantSet = constant_factory_set,
fitnessFunction = fitness_function,
stopCondition = makeTimeStopCondition(20) #run for 20 seconds
)

This will run for 20 seconds, during which time you will hopefully find functions which converge closer and closer to the ideal. In my case, it looked like this:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 0.706164, time elapsed: 0.84 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 0.445428, time elapsed: 1.4 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 0.445256, time elapsed: 2.42 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 0.445256, time elapsed: 3.53 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 0.445256, time elapsed: 4.32 seconds
evolution step 600, fitness evaluations: 29950, best fitness: 0.445256, time elapsed: 5.26 seconds
evolution step 700, fitness evaluations: 34950, best fitness: 0.445256, time elapsed: 6.15 seconds
evolution step 800, fitness evaluations: 39950, best fitness: 0.445256, time elapsed: 6.75 seconds
evolution step 900, fitness evaluations: 44950, best fitness: 0.445210, time elapsed: 7.59 seconds
evolution step 1000, fitness evaluations: 49950, best fitness: 0.445210, time elapsed: 8.29 seconds
evolution step 1100, fitness evaluations: 54950, best fitness: 0.445210, time elapsed: 8.9 seconds
evolution step 1200, fitness evaluations: 59950, best fitness: 0.445210, time elapsed: 9.81 seconds
evolution step 1300, fitness evaluations: 64950, best fitness: 0.445210, time elapsed: 12.35 seconds
evolution step 1400, fitness evaluations: 69950, best fitness: 0.445210, time elapsed: 13.09 seconds
evolution step 1500, fitness evaluations: 74950, best fitness: 0.445210, time elapsed: 14.81 seconds
evolution step 1600, fitness evaluations: 79950, best fitness: 0.445198, time elapsed: 15.48 seconds
evolution step 1700, fitness evaluations: 84950, best fitness: 0.445198, time elapsed: 16.17 seconds
evolution step 1800, fitness evaluations: 89950, best fitness: 0.445198, time elapsed: 17.29 seconds
evolution step 1900, fitness evaluations: 94950, best fitness: 0.445198, time elapsed: 18.01 seconds
evolution step 2000, fitness evaluations: 99950, best fitness: 0.445198, time elapsed: 18.68 seconds
evolution step 2100, fitness evaluations: 104950, best fitness: 0.445198, time elapsed: 19.31 seconds
Genetic programming evolution run FINISHED after 2138 evolution steps, 106850 fitness evaluations and 20 seconds.

There was some convergence, but it wasn’t the best. Let’s take a look at the best program:

best_solution <- result$population[[which.min(result$fitnessValues)]]
best_solution
min(result$fitnessValues)

That spits back the following:

function (x)
0.305371214275836 * (x - -0.0445197474230304)
0.445198186160344

The reduced form function is 0.305x + 0.013595. That’s a linear function with respect to x, and the RMSE is 0.445, which is not great.

Here’s a plot of our solution versus the ideal solution:

plot(y = best_solution(function_interval), x = function_interval, type = "l", lty = 1, xlab = "x", ylab = "y")
lines(y = sin(function_interval), x = function_interval, lty = 2)

Genetic program with a one-track mind.

#### A Craftsman Goes Out And Gets Better Tools

That first result wasn’t very good, and part of the problem is that we did not give the genetic program the right tools to figure out a solution to our problem. So let’s change the function set just a little bit:

function_set <- functionSet("+", "*", "-", "cos")

You and I know that cosine and sine have a direct relationship: shift sine by pi/2 and you get cosine. Let’s see if the genetic program figures out the same. This time around, I ran the function for 7 seconds, and we can see the improvement:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 0.458847, time elapsed: 0.64 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 0.445672, time elapsed: 1.32 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 0.124575, time elapsed: 2 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 0.003608, time elapsed: 2.64 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 0.003608, time elapsed: 4.41 seconds
evolution step 600, fitness evaluations: 29950, best fitness: 0.003608, time elapsed: 5.22 seconds
evolution step 700, fitness evaluations: 34950, best fitness: 0.003608, time elapsed: 5.97 seconds
evolution step 800, fitness evaluations: 39950, best fitness: 0.003608, time elapsed: 6.71 seconds
Genetic programming evolution run FINISHED after 840 evolution steps, 41950 fitness evaluations and 7 seconds.

For the first couple seconds, the algorithm wasn’t doing much better than it had last time around, but then something clicked and the best solution got a lot better. Let’s take a look at it.

best_solution <- result$population[[which.min(result$fitnessValues)]]
best_solution
min(result$fitnessValues)

function (x)
cos(x - 0.291574502599581 - 1.2741260116152)
0.00360807823117699

In other words, this is cos(x – 1.5657), which is **very** close to pi/2. If we plot the curve, we’ll see an almost-exact match.

plot(y = best_solution(function_interval), x = function_interval, type = "l", lty = 1, xlab = "x", ylab = "y")
lines(y = sin(function_interval), x = function_interval, lty = 2)

Turns out cosine and sine are actually related.

#### Finding The Best Tools

We can take this one step further and have the function set include all basic math + trigonometric functions like so:

function_set <- trigonometricFunctionSet

In rgp, this includes the set { sin, cos, tan }. When we run the function with this set, it finds the best solution almost immediately: sin(x).

### Tell Me What I Don’t Know

Now we’re going to shake things up a bit. Instead of trying to match a function, we will take a sample of data collected in the wild and attempt to fit a function to it.

if(!require(rgp)) {
install.packages("rgp", repos = "http://cran.us.r-project.org")
library(rgp)
}
if(!require(igraph)) {
install.packages("igraph", repos = "http://cran.us.r-project.org")
library(igraph)
}

We will still use rgp, but I’m also going to introduce igraph, which allows us to visualize graphs easily.

Unlike before, we don’t have a function in mind; rather, we have some x-y coordinates.

input_data <- data.frame(
x = c(1, 2, 3, 4, 5, 6, 7),
y = c(0, 26, 88, 186, 320, 490, 541)
)

If we use `plot(x,y)`

to plot this, we get the following image:

Did you know that disco record sales were up 400% for the year ending 1976? If these trends continue…AY!

Let’s turn this set of numbers into a plausible example. Let’s say we have a rocket car we’re testing, and the y axis represents speed in miles per hour, whereas the x axis represents seconds. What we see is an immediate burst and exponential growth for the first few seconds, but then we hit an inflection point and start tailing off near the end. If we drew this out for a few more seconds, we’d probably see the y value flatline somewhere under 600.

Now that we have our just-so story in place, let’s build our process. We will use the basic arithmetic operators and perform a symbolic regression of y against x, running for 500 steps.

function_set <- functionSet("+", "-", "*")
result <- symbolicRegression(
y ~ x,
data = input_data,
functionSet = function_set,
stopCondition = makeStepsStopCondition(500)
)

We could run for a specific period of time as well; rgp has several stop conditions available. Anyhow, here’s how the results look in my particular run:

STARTING genetic programming evolution run (Age/Fitness/Complexity Pareto GP search-heuristic) ...
evolution step 100, fitness evaluations: 4950, best fitness: 33.007936, time elapsed: 3.75 seconds
evolution step 200, fitness evaluations: 9950, best fitness: 33.007936, time elapsed: 7.18 seconds
evolution step 300, fitness evaluations: 14950, best fitness: 33.007936, time elapsed: 10.36 seconds
evolution step 400, fitness evaluations: 19950, best fitness: 33.007936, time elapsed: 13.53 seconds
evolution step 500, fitness evaluations: 24950, best fitness: 33.007936, time elapsed: 16.9 seconds
Genetic programming evolution run FINISHED after 500 evolution steps, 24950 fitness evaluations and 16.9 seconds.

Looks like we hit a pretty decent result early but couldn’t improve on it in the allocated number of operations. Here’s what our result looks like:

plot(input_data$y, col=1, type="l")
points(predict(result, newdata = input_data), col=2, type="l")

No, don’t worry, we can forecast outside of the prediction range no problem!

And here’s the formula that our best result generates:

best_result <- result$population[[which.min(result$fitnessValues)]]
best_result

8.87532839551568 * (-1.28207603423574 * -1.08413319016388) * x * x

Simplifying that, we have approximately 12.336x^2. That’s a pretty good fit for the data set, though it does not capture the falloff near the end and would perform poorly in our out-of-sample data (assuming, again, that subsequent results continue the late trend of trailing off toward a maximum value under 600).

#### 1000 Words Coming Up

I’d like to display the result as a graph, because now that I’m in management, I find myself incapable of understanding anything that does not have a picture associated with it. The rgp package has a nice function called funcToIgraph() built in, which allows you to turn best_result into a graph that igraph can read and display. There’s a bug in rgp, though: when rgp was built, igraph used 0-indexed notation, but they subsequently shifted to 1-based notation. Therefore, rgp has an off-by-one bug. To get around this, I wrote my own function that I can use in this demo:

exprToIgraph <- function(expr) {
if (!require("igraph")) stop("exprToIgraph: Package 'igraph' not installed.")
exprGraph <- exprToGraph(expr)
exprIgraph <- graph(exprGraph$edges, n = length(exprGraph$vertices)) # igraph vertexes used to be counted from zero but no longer are!
V(exprIgraph)$label <- exprGraph$vertices
exprIgraph
}

I’d happily submit it as a pull request, but I could only find mirrors and forks of rgp and not any active repository.

Anyhow, let’s satisfy the manager inside and turn this formula into a picture:

final_graph <- exprToIgraph(body(best_result))
plot(final_graph, layout=layout_as_tree, vertex.color="orange", vertex.size=30, edge.arrow.size=0.1)

Somebody out there is mentally picturing this graph in Reverse Polish Notation.

There we go; now I’m rejuvenated and ready to demand some more TPS reports.

### Conclusion

Today’s post was a look at a pair of genetic programs which derive mathematical functions. This is one of the main uses of genetic programming, but it’s not the only one. In tomorrow’s post, I will look at another major use case.