This is part of a series entitled Genetics In Action.
So far, we have learned a little bit about evolutionary algorithms and taken a look at just enough high school biology to review the basics of genetics. Today, I want to look at one particular type of evolutionary algorithm called a genetic algorithm.
Although John Holland did not invent the concept of genetic algorithms, he is the man most responsible for popularizing and developing the concept. Holland’s Adaptation in Natural and Artificial Systems is a classic of the field and ties together the genetic metaphor. I highly recommend this book if you’re interested in the topic.
Digging Into The Algorithm
In the last post, we looked at how the genetic metaphor ties development concepts to biological concepts, and today we will move beyond that high-level description and cover the specifics of how genetic algorithms work. We will start by looking at the simplest type of genetic algorithm: a single chromosome with a fixed number of genes, each of which has two alleles. In computer science terms, think about a fixed-length array of boolean values.
In the image above, we can see two sample chromosomes, each of which has eight genes. We want to build up a population of these chromosomes at random—one of the interesting parts of the genetic algorithm is that it (usually) doesn’t matter where in the total search space we begin, so starting at a bunch of random points is just as good as anything else.
When it comes to choosing the size of the population, there aren’t too many hard-and-fast rules. I have read recommendations that you should have at least 2 * N, where N is the number of genes that each chromosome has. If you’re looking at 10-30 genes, I’ve typically had good luck with a population size of somewhere between 100 and 500. You’ll find out that there is a maximum interesting population size, after which you don’t really get any benefit: you won’t converge on a solution any faster, and it will take longer to do all of the processing.
Once we have our population, the next step is to score each organism in the population. To score an organism, we apply a fitness function. In computer science terms, this is usually an actual function, where we use the organism’s chromosomal representation as the inputs and generate and return a score for each chromosome.
In the image above, we have defined a score for each organism. This score is typically one of two things: either it is the distance from an ideal point, or it is a valuation. In the first case, think of a set of (x, y) coordinates. We want to define a chromosome that, given an x coordinate, will generate its appropriate y coordinate. We will calculate some distance between the predicted y and the actual y (for example, we could calculate the Root Mean Square Deviation), where a perfect match has a deviation score of 0. On the other side, suppose that we can produce N separate products. Each product has a cost and a price for sale. Our genetic algorithm might describe the combination of goods we create, and the score would be the net margin (revenue minus cost) of those products. In that case, a higher number is better for us.
It’s A Generational Thing
Now that we have the basic idea of a fitness score behind us, let’s go to the next step: making babies.
Now I am showing the entire population, which has four members. Each member of the population has its own score, and we will use those scores to help us figure out the next generation of organisms. The mechanism I am showing in the picture above is the simplest mechanism for genetic algorithms, which is the roulette wheel selection. Basically, take the fitness values for each member of the population and you get a total score—that’s the 508 above. Figure out each member’s percentage of the score, and you have a set of values which sums up to 1. Pick a random number between 0 and 1, and wherever you land on the cumulative distribution function, you have your choice of parent. Note that you draw with replacement, meaning that you can pull the same organism more than once.
To motivate this example, let’s suppose that red owns numbers from 0 up to .335, blue owns numbers from .335 up to .587, green owns .587 until .815, and yellow owns .815 through 1.0. Our first random number drawn is .008, so the lucky winner is red. Then, we draw a second parent and pulled .661, which happens to be squarely in green territory. We now have our two parents.
Now that we have our two parents, we are going to generate two children. I need to introduce a new concept: crossover.
Crossover is the recombination of a segment of a chromosome. In the example above, we are switching genes 3-5 from each of the parents for each of the children (though child #2 is shy and is hiding off-camera).
This action is part of the genius behind genetic algorithms. We’ve already taken some of the fittest organisms (in a large population, we’re going to pull successful organisms much more frequently than unfit organisms, and there are other pulling techniques which bias toward successful chromosomes even more than roulette wheel), and by recombining slices of their genes, we are able to test the waters with new combinations to see if we can find something even more successful.
Of course, there’s no guarantee that the new combination will be more successful than its parents were, so we have a concept known as the crossover percentage. That is, we only perform crossover a certain percentage of the time. In practice, this is often anywhere from 60% to 90%. If we don’t perform crossover, then the two chromosomes just mosey on into the next step of the process. But if we do roll the dice and land on the ol’ chop-and-swap, then we have two more RNG rounds to play.
The first of these bonus random number pulls determines where we start the chop, and the second pull determines where we stop. In the picture above, we start at gene 3 and end at gene 5, inclusive. In genetic algorithms, we typically have fixed-size chromosomes (though that’s not always true!) and therefore symmetrical swaps.
The last step in our genetic factory floor is mutation. One problem with survival-of-the-fittest is that, especially in later generations, we might run low on genetic diversity. At an extreme, we end up with a population full of exactly the same chromosome, so no matter how you slice them, you get no novel patterns. If we’ve reached the global maximum, that’s acceptable, but what if we ended up at only a local maximum and can’t jump over to the global max? That’s where mutation comes into play.
Mutation works by modifying a particular gene’s value. For each gene in each new chromosome, mutate with a probability p. Usually p is somewhere between 0.001% and 1%, though I’ve read papers that argue in certain circumstances, you might need a mutation rate of 20% to 30%. Those would be cases with very flat local minima/maxima where you can get trapped in that flat spot and need a kick out.
If you want a fitting metaphor for flat spots, I had an old Toyota Corolla with an old starter. I’d be able to start the car up successfully several times in a row, but then I’d hit a dead spot in the flywheel and it just wouldn’t start. Eventually, my dad bought me the Official Starter Repair Kit: a crowbar. His advice was to apply sufficient percussive force until I nudged the starter out of its dead spot, and then the car could start successfully. Mutation provides benefits in much the same way. And just like my beater bar, mutation is not a technique you want to rely upon constantly. At the extreme, mutation is just random search, losing all of the important information that a genetic algorithm learns during the process.
Finishing The Process
At this point, we have a finished organism.
We do this for each slot in the population and then repeat the process: score, choose, cross over (or not), mutate. We have a few potential end conditions. Some of them are:
- Stop after a fixed number of generations
- Stop after we reach a known ideal solution (e.g., 0 distance from the actual values)
- Stop after we get close enough to a known ideal solution
- Stop after we have stasis for a certain number of generations, where we have not improved the fitness score for the best agent in a while
- Stop after a certain amount of time
There are other stop conditions as well, but these are the most common.
Today we covered the basics of genetic algorithms and how the process works. Tomorrow, we’ll look at using a genetic algorithm library to solve different types of problems.