I had the privilege to sit down with a couple Hortonworks solution engineers and discuss a potential Hadoop solution in our environment. During that time, I learned an interesting strategy for handling data in Kafka.
Our environment uses MSMQ for queueing. What we do is add items to a queue, and then consumers pop items off of the queue and consume them. The advantage to this is that you can easily see how many items are currently on the queue and multiple consumer threads can interact, playing nice with the queue. Queue items last for a certain amount of time—in our case, we typically expire them after one hour.
With Kafka, however, queues are handled a bit differently. The queue manager does not know or care about which consumers read what data when (making it impossible for a consumer to tell how many items are left on the queue at a certain point in time), and the consumers have no ability to pop items off of the queue. Instead, queue items fall off after they expire.
Our particular scenario has some web servers which need to handle incoming clicks. Ideally, we want to handle that click and dump it immediately onto a queue, freeing up the web server thread to handle the next request. Once data gets into the queue, we want it to live until our internal systems have a chance to process that data—if we have a catastrophe and items fall off of this queue, we lose revenue.
The strategy in this case is to take advantage of multiple queues and multiple stages. I had thought of “a” queue, into which the web server puts click data and out of which the next step pulls clicks for processing. Instead of that, a better strategy (given what we do and our requirements) is to immediately put the data into a queue and then have consumers pull from the queue, perform some internal “enrichment” processes, and finally put the enriched data back onto a new queue. That new queue will collect data and an occasional batch job pulls it off to write to HDFS. This way, you don’t take the hit of streaming rows into HDFS. As far as maintaining data goes, we’d need to set our TTL to last long enough that we can deal with an internal processing engine catastrophe but not so long that we run out of disk space holding messages. That’s a fine line trade-off we’ll need to figure out as we go along.
Summing things up, Kafka is quite a different product than MSMQ, and a workable architecture is going to look different depending upon which queue product you use.