Wrapping up my series on Kafka, I want to spend a few moments thinking about how my Kafka application has performed. When I started this journey, I wanted to put together an easy-to-understand application which performs well. I think I’ve solved the first problem well enough, but still have the second problem. What follows are some assorted thoughts I have regarding this performance discrepancy.
Keep in mind throughout this that my simple CSV pump-and-dump averaged 18K records per second, my enricher averaged somewhere around 2K records per second, and my data aggregator averaged approximately 10K records per second. Considering that the article which drove Kafka to prominence showed 2 million writes per second, I’m way under-shooting the mark. Without performing any serious tests, here are a few things that might increase the throughput:
- I’ve run the test from a single consumer to a single broker from a single laptop. That gives me several areas for parallelism: I could have more consumers pushing to more brokers, each on dedicated hardware.
- I’m pushing to just one partition. If I increase the number of brokers, I should increase the number of partitions to ensure that we’re writing to each of the brokers in parallel.
- I don’t believe that I’m batching messages to Kafka. One way of increasing throughput is to bundle a number of small messages together, with the idea being that there is messaging overhead with TCP, which we have to treat as a fixed cost. By packing more data in, given a fixed cost, we reduce the relative burden of that fixed cost. We also have certain fixed costs around Kafka handling individual messages and writing them individually.
- I’m using a .NET library. The .NET libraries for Kafka seem to be a lot slower than their Java counterparts. In discussions with another person who pulled off a similar project, he noticed about a 5x improvement in switching from a .NET library to a Java library (100k records/sec to 500k). I’m sure that’s a ‘Your Mileage May Vary’ scenario, but it makes sense to me, given the relative focus in the entire Hadoop ecosystem is Java-first (and sometimes Java-only).
After monkeying with some configuration settings, I was able to get about 25,000 messages per second into the Flights topic and could bump up the others a fair percentage as well. When generating messages and shooting them directly into Kafka, I was able to get 30K messages per second, so it’s nowhere near 800K messages per second (as Jay Kreps was able to get in his single producer thread, no replication scenario), but it’s nothing to sneeze at.