Here’s a problem I ran into as I worked through my Kafka code. Near the end of the consumer project, I used the Seq.zip3 function to zip together three separate tuples and turn them into a quadruple. This is a great function to flatten out a set of tuples, but I kept getting results that looked like this:
I kept running into situations like the above. With Iowa, for example, the process said that we had 1994 flights come in over the time frame, but there were 3719 flights delayed. This…didn’t seem right. Similarly, there were 30,772 flights to Vermont, of which only 871 were delayed? I have trouble believing that there were more flights to Vermont than Connecticut or Kentucky (which hosts the Cincinnati airport).
To help debug the code, I checked each individual tuple and spit out the counts, and noticed that although the delay counts were fine, the totals were way off. After digging into the zip function a bit more, it clicked. Zip does not sort first!
When you call the zip method, you’re not “joining” in the relational sense; you’re simply taking the first element from the first set and attaching to it the first element from the second set; then you take the second element from each set, the third, etc.
Once I understood that and corrected the code, life was good: