Today’s post will look at the first of three console applications. This app is going to read in flight data from an external source (in my case, a flat file) and push messages out to a Kafka topic.
A Sidebar: F#
If you look on the left-hand side, you’ll see the three console projects that we are going to use for this Kafka solution. You may notice that these are all F# projects.
Whenever I present my Integrating Hadoop and SQL Server talk, I always get asked one question: “Why are you using F#?” I like getting this question. The honest answer is, because I enjoy the language. But there’s a bit more to it than that.
First, let me respond to the important subtext of the question: no, there’s nothing in this project that we absolutely need F# for. I can just as easily write C# code to get to the same answer, and if you’re more comfortable writing code in C#, then go for it.
So why, then, do I persist? Here are my reasons, in no particular order:
- I want to learn things, and the best way to learn a language is to use it. I’m not particularly good at writing F# code; I’ve been exposed to it for a couple of years now and can code my way out of a wet paper bag, but it’d be scary to jump onto a project as an F# developer. I want to change that.
- The reason I want to change that is that I see the power of functional languages. If you’re not sold on the importance of functional languages, I need only note that both C# and Java have made significant functional inroads with LINQ/lambdas, closures, pipelines (via the fluent interface in C#), etc.
- I want to show that the entryway from C# to F# is pretty easy. A very complex block of F# code will leave me scratching my head, absolutely, but I can give working, tutorial-style code to people and have them easily understand it. If I’m going to show you easy code, I’d rather show you easy code in a language you don’t know—that way, you’re learning two things at the same time. This does run the risk of unnecessarily confusing people, admittedly, but I think the risk is relatively low gien my target audience.
- I think there’s a synergy between functional languages and SQL. I believe that having a background in SQL gives you an appreciation for set-based solutions and treating data and functions as first-class citizens. A language like F# or Scala appeals to the relational database developer in me, as I’m using the pipeline to push data through functions rather than nesting for loops.
- The code is more concise. My producer is barely 40 lines of code, including a big comment block. This is probably a trivial enough application that I could write a similar C# app in not too many more lines of code, but one scales with respect to complexity much better than the other.
Those are my major reasons. There are reasons not to use F#, but my belief is that the reasons to use this language in my demos outweigh the negatives.
Anyhow, with all of that said, let’s check out what I’m doing.
For the entire solution, I’m going to need a few NuGet packages. Here are my final selections:
The first two are F#-specific packages. We’ll see ParallelSeq in action today, but the SqlClient will have to wait until tomorrow, as will Newtonsoft.Json.
Other Kafka libraries for .NET that I looked at include:
- Kafka Sharp, which is intended to be a fast implementation.
- Franz, a Kafka library written in F# (but works just as well with C#).
- Kafka4net, another implementation designed around Kafka 0.8.
- CSharpClient-for-Kafka, Microsoft’s implementation.
You’re welcome to try out the different libraries and see what works best for you. There was even an article written around writing your own Kafka library.
The console app is pretty simple. Simple enough that I’m going to include the functional chunk in one code snippet:
module AirplaneProducer open RdKafka open System open System.IO open FSharp.Collections.ParallelSeq let publish (topic:Topic) (text:string) = let data = System.Text.Encoding.UTF8.GetBytes(text) topic.Produce(data) |> ignore let loadEntries (topic:Topic) fileName = File.ReadLines(fileName) |> PSeq.filter(fun e -> e.Length > 0) |> PSeq.filter(fun e -> not (e.StartsWith("Year"))) //Filter out the header row |> PSeq.iter(fun e -> publish topic e) [<EntryPoint>] let main argv = use producer = new Producer("sandbox.hortonworks.com:6667") let metadata = producer.Metadata(allTopics=true) use topic = producer.Topic("Flights") //Read and publish a list from a file loadEntries topic "C:/Temp/AirportData/2008.csv" printfn "Hey, we've finished loading all of the data!" Console.ReadLine() |> ignore 0 // return an integer exit code
That’s the core of our code. The main function instantiates a new Kafka producer and gloms onto the Flights topic. From there, we call the loadEntries function. The loadEntries function takes a topic and filename. It streams entries from the 2008.csv file and uses the ParallelSeq library to operate in parallel on data streaming in (one of the nice advantages of using functional code: writing thread-safe code is easy!). We filter out any records whose length is zero—there might be newlines somewhere in the file, and those aren’t helpful. We also want to throw away the header row (if it exists) and I know that that starts with “Year” whereas all other records simply include the numeric year value. Finally, once we throw away garbage rows, we want to call the publish function for each entry in the list. The publish function encodes our text as a UTF-8 bytestream and pushes the results onto our Kafka topic.
Running from my laptop, loading 7 million or so records took 6 minutes and 9 seconds.
That averages out to about 19k records per second. That’s slow, but I’ll talk about performance later on in the series. For real speed tests, I’d want to use independent hardware and spend the time getting things right. For this demo, I can live with these results (though I’d happily take an order of magnitude gain in performance…). In testing the other two portions, this is the fastest of the three, which makes sense: I’m doing almost nothing to the data outside of reading and publishing.
We’ve built a console app which pushes data to a Kafka topic. That’s a pretty good start, but we’re only a third of the way done. In tomorrow’s post, I’m going to pull data off of the Flights topic and try to make some sense of it. Stay tuned!