Today, we’re going to look at the specific problem I want to solve using Kafka.

The Data

I am going to use flight data because it is copious and free.  Specifically, I’m going to look at data from 2008, giving me approximately 7 million data points.

The Question

Given this data, I’d like to ask a couple of questions.  Specifically, by destination state, how many flights did we have in 2008?  How many of those flights were delayed?  How long were those delays in the aggregate?  And given that we’re going to experience a delay, how long can we expect it to be?

The Other Question:  Why Use Kafka?

Hey, I know how to solve this problem with SQL Server:  read the 2008 CSV file in and write a basic SQL query.  Or I could load the data in Hive and run the same query.

So why would I want to use Kafka?  Isn’t this overkill?  In this case it is.  But let’s spin the story a slightly different way:  instead of reading from a file, let’s pretend that our consumers are receiving messages from a web service or some other application.  We now want to stream the data in, possibly from a number of sources.  We can decouple message collection from message processing.

In the real world, my problem involves handling clickstream data, where we’re looking at hundreds of millions of messages per day.  So what I’m dealing with is certainly a toy problem, but it’s the core of a real problem.

The Infrastructure

So here’s my plan:  I’m going to have a console application read from my flight data file, putting messages into a single Kafka topic with a single partition.  The purpose of this app is to do nothing more than read text string and push them onto the topic, doing nothing fancy.

From there, I intend to enrich the data using another application.  This will process messages on the raw topic, clean up messages (e.g., fixing known bad data), and perform a lookup against some data in SQL Server.  Once we’re done with that, we can put the results on an enriched data topic.  Our final process will read from the enriched topic and build up a data set that we can use for our solution.

What’s Next?

The next four posts are going to cover some specifics.  In the next post, I’m going to walk through some of the really simple parts of Kafka, including setting up topics and pushing & receiving messages.  From there, the next three posts will look at the three console applications I’m going to write to handle the different stages of this application.  I’ll wrap it up with a couple more Kafka-related posts.  So stay tuned!

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s