This week’s post has to do with handling dates in files. One of the data sets I use in demos for PolyBase Revealed is an open data set of Raleigh, North Carolina police incidents. This data set is nice because it’s enough rows not to be a trivial example, but still small enough that people can download and use it without too much issue. It also has relatively few data issues and a little bit of column variety.
A Non-Problem Problem
If you download the latest version of that data set as a spreadsheet, you get a CSV file. Open up that CSV file in the editor of your choice and you’ll see dates in a particular format:
This is an ISO 8601 date format for date and time in UTC. It is a good format because it tells you the date, the time (down to milliseconds), and that the date is in UTC format (time zone Z). The T in the middle ensures that there are no spaces in the time, so parsers looking to split on space don’t get confused. All in all, this is a good format. But PolyBase doesn’t support it. That’s a failing on PolyBase’s part rather than the process which generated this data because this is an international standard format for dates and times, not something goofy that someone came up with one day.
PolyBase has a very limited number of supported date formats. This ISO standard is not one of the supported date formats, so if I were to import the data and set these two columns to
DATETIME, or any other date format, PolyBase would choke. That leaves us with a few options:
- Change the data as it gets pushed out from the source system. Sometimes you can control this, and if you can, that’s the easiest approach. In this case, because I’m getting data from a public data set, that’s not really an option.
- Load the data as
VARCHARrather than a date or time format. This option is okay if you don’t actually need the data as dates or if you don’t care about those particular columns.
- Convert the input file to ORC or Parquet format. I’ll have a blog post on that sometime in the future, as I think it’s a useful skill to know. But that’s not going to be an ideal solution across the board.
- Convert the data as a pre-processing step. Read it into some system, make relevant changes, and write it out in a way that PolyBase can understand.
I’m going to do a simple form of step 4 in this blog post.
Now You Have Two Problems
The way that we’re going to replace the dates in this file using a format that PolyBase can understand. Using the first date as an example, the file has
2014-11-08T15:47:00.000Z. Looking at the list of supported date formats above, PolyBase can handle
yyyy-MM-dd HH:mm:ss.fff format, so we can turn the date into
2014-11-08 15:47:00.000 and the PolyBase engine can understand this and convert it to an appropriate date format. Our fix here is to replace the T with a space and remove the Z from any date.
It turns out that there is a nice regular expression for this. I’m using Notepad++ to make this change, though other tools will also allow for regular expression-based find-and-replace operations. In Notepad++, hit
Ctrl+H to bring up the replace menu. Be sure that you have selected “Regular expression” for the search mode and enter the following into the find and replace boxes.
Your menu should look a bit like the one below:
What we’re doing is building out two capture groups in our regular expression: things which follow a particular shape and are wrapped in parentheses. In our case, the first capture group is the year-month-day combo in the format 4 digits, followed by a hyphen, followed by two digits, followed by a hyphen, followed by two more digits. As far as our example goes, that would be
The second capture group is the time portion: two digits, a colon, two more digits, another colon, two more digits, a decimal point, and three digits. In our example, that would be
The reason we care about these capture groups is that we can reference them as they are using backslashes. In the replace, I’m going to write out the first capture group, then a space, and then the second capture group. We get rid of the T and Z characters while retaining the actual values for date and time even when we do not know the particulars of that date and time. That’s important to remember here: I don’t want to try to figure out every combination of day followed by the letter T followed by hour, and I don’t want to risk false positives around strings which have a T or Z in them.
If I run Replace All with this regular expression, it makes two replacements for each line in the file. On my under-powered work machine, that took about 10 seconds to perform nearly 500K replacements. It’s not the fastest method for making this change and I wouldn’t try it on files with tens of millions of rows or scenarios where I’d have to do this every day or every week, but for a one-off job, it does alright.
If you don’t want to go through this hassle in the future, please upvote Greg Galloway’s proposal to improve date format support in PolyBase. And hope that this improvement (if it happens) makes it into the on-prem product.