In the last exciting episode, we loaded some Mars farming data into our Data Explorer pool. Today, we’re going to query that data.

Navigating to https://dataexplorer.azure.com, we can see the Synapse workspace, Data Explorer pool, and table in the Query tab.

Ready to query!

In order to query data, we need to use the Kusto Query Language, KQL. If you’re familiar with Splunk’s language, KQL is pretty similar. It’s just enough like SQL that it feels like you should understand it but not SQL-like enough that you’ll actually have an intuitive understanding of the language.

One quick note is that all KQL statements are case-sensitive. I personally consider this a mistake in a query language, but they didn’t ask me, I suppose. With that said, let’s get digging.

First up, I want to see how many records I have in the tibble table. I can do that with the following command:

tibble
| count

Note that we use the pipe operator to separate statement components, sort of like a pipeline.

2 billion rows? That’s a pretty good amount of data!

So yeah, I loaded 2.2 billion rows in there, no big deal. I generated one sample every 10 minutes over the course of 5 years for 9000 plots of data. In fairness, this was for a training on big data analytics and I couldn’t do that on a couple dozen rows.

Anyhow, now that we know that we do have data, let’s take a quick look at it. We’ll get the top 10 rows.

tibble
| take 10
Definitely 10 rows.

We can see the structure of this data. I should also note that we haven’t done any type of sorting of the data, so expect it in an arbitrary ordering.

Now let’s do something a little more complicated and see how long it takes us to get all scores for plot 1406 on the day of 2119-06-04.

tibble
| where PlotID == 1406
| where SampleDateTime between (datetime(2119-06-04)..datetime(2119-06-05))
Filter with bonus shark tooth effect.

Ah, but wait! The bottom row shows 2119-06-05 at midnight, which is one more than what we want. Similarly to SQL, between in KQL is inclusive on both sides. The solution is pretty similar as well:

tibble
| where PlotID == 1406
| where SampleDateTime >= datetime(2119-06-04)
| where SampleDateTime < datetime(2119-06-05)

If we have two conditions, we can handle exclusive date ranges.

We can see that the arability score doesn’t change drastically over the course of the day. What if we averaged out arability scores by day and bucketed over a timeframe? Let’s also introduce a couple of variables to see how we can use them and also do a bit of time math.

In order to make things a little less complicated, we’ll take it a step at a time.

let _startDate = datetime(2119-03-01);
let _endDate = datetime(2119-04-01);
tibble
| where PlotID == 1406
| extend _bin = bin_at(SampleDateTime, 1d, _startDate)
| extend _range = range(_bin, _endDate, 1d)
| where SampleDateTime >= _startDate
| where SampleDateTime < _endDate

We can see that the let keyword defines variables and I create two here: _startDate and _endDate. I use those in the where clauses at the bottom of the query to constraint our range. I also use the extend operator to create new columns, one called _bin and the other _range.

Bin and Range was a flop at the box office, as audiences weren’t yet ready for a buddy cop comedy.

The _bin column uses bin_at() to group each data point by timeframe, in this case, one-day intervals (1d). The _range column gives us a JSON array of the entire range of our dataset. We don’t actually need _range here but it does give us an idea of one useful function. So let’s get rid of _range and then get some summary statistics.

let _startDate = datetime(2119-03-01);
let _endDate = datetime(2119-04-01);
tibble
| where PlotID == 1406
| extend _bin = bin_at(SampleDateTime, 1d, _startDate)
| where SampleDateTime >= _startDate
| where SampleDateTime < _endDate
| summarize min(ArabilityScore), max(ArabilityScore), avg(ArabilityScore) by _bin
Behold the variance of my arability!

Okay, so it turns out that arability score doesn’t actually change all that often. But that’s okay: we can see that the process works.

One quick note is that the summarize command needs to be after the where clauses because summarize makes columns like SampleDateTime unavailable to us. In other words, here we want to filter and then group, not the other way around.

Conclusion

In today’s post, we got to see a few bits of the Kusto Query Language in action, including table selection, basic filtering, basic aggregations, and a little bit of binning. In the next post, we will dive a bit deeper into the language and see what else we could unearth (un-mars?).

Advertisement

One thought on “Getting Started with KQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s