In the last exciting episode, we loaded some Mars farming data into our Data Explorer pool. Today, we’re going to query that data.
Navigating to https://dataexplorer.azure.com, we can see the Synapse workspace, Data Explorer pool, and table in the
In order to query data, we need to use the Kusto Query Language, KQL. If you’re familiar with Splunk’s language, KQL is pretty similar. It’s just enough like SQL that it feels like you should understand it but not SQL-like enough that you’ll actually have an intuitive understanding of the language.
One quick note is that all KQL statements are case-sensitive. I personally consider this a mistake in a query language, but they didn’t ask me, I suppose. With that said, let’s get digging.
First up, I want to see how many records I have in the
tibble table. I can do that with the following command:
tibble | count
Note that we use the pipe operator to separate statement components, sort of like a pipeline.
So yeah, I loaded 2.2 billion rows in there, no big deal. I generated one sample every 10 minutes over the course of 5 years for 9000 plots of data. In fairness, this was for a training on big data analytics and I couldn’t do that on a couple dozen rows.
Anyhow, now that we know that we do have data, let’s take a quick look at it. We’ll get the top 10 rows.
tibble | take 10
We can see the structure of this data. I should also note that we haven’t done any type of sorting of the data, so expect it in an arbitrary ordering.
Now let’s do something a little more complicated and see how long it takes us to get all scores for plot 1406 on the day of 2119-06-04.
tibble | where PlotID == 1406 | where SampleDateTime between (datetime(2119-06-04)..datetime(2119-06-05))
Ah, but wait! The bottom row shows 2119-06-05 at midnight, which is one more than what we want. Similarly to SQL,
between in KQL is inclusive on both sides. The solution is pretty similar as well:
tibble | where PlotID == 1406 | where SampleDateTime >= datetime(2119-06-04) | where SampleDateTime < datetime(2119-06-05)
If we have two conditions, we can handle exclusive date ranges.
We can see that the arability score doesn’t change drastically over the course of the day. What if we averaged out arability scores by day and bucketed over a timeframe? Let’s also introduce a couple of variables to see how we can use them and also do a bit of time math.
In order to make things a little less complicated, we’ll take it a step at a time.
let _startDate = datetime(2119-03-01); let _endDate = datetime(2119-04-01); tibble | where PlotID == 1406 | extend _bin = bin_at(SampleDateTime, 1d, _startDate) | extend _range = range(_bin, _endDate, 1d) | where SampleDateTime >= _startDate | where SampleDateTime < _endDate
We can see that the
let keyword defines variables and I create two here:
_endDate. I use those in the
where clauses at the bottom of the query to constraint our range. I also use the
extend operator to create new columns, one called
_bin and the other
_bin column uses
bin_at() to group each data point by timeframe, in this case, one-day intervals (
_range column gives us a JSON array of the entire range of our dataset. We don’t actually need
_range here but it does give us an idea of one useful function. So let’s get rid of
_range and then get some summary statistics.
let _startDate = datetime(2119-03-01); let _endDate = datetime(2119-04-01); tibble | where PlotID == 1406 | extend _bin = bin_at(SampleDateTime, 1d, _startDate) | where SampleDateTime >= _startDate | where SampleDateTime < _endDate | summarize min(ArabilityScore), max(ArabilityScore), avg(ArabilityScore) by _bin
Okay, so it turns out that arability score doesn’t actually change all that often. But that’s okay: we can see that the process works.
One quick note is that the
summarize command needs to be after the
where clauses because
summarize makes columns like
SampleDateTime unavailable to us. In other words, here we want to filter and then group, not the other way around.
In today’s post, we got to see a few bits of the Kusto Query Language in action, including table selection, basic filtering, basic aggregations, and a little bit of binning. In the next post, we will dive a bit deeper into the language and see what else we could unearth (un-mars?).