Course Reviews: Writing Functions in R

This is a review of Hadley and Charlotte Wickham’s DataCamp course entitled Writing Functions in R.

Wickham and Wickham (borther and sister) do a great job of introducing some of the concepts of functional programming in R. The first chapter is a refresher on some of the basic program structures in R. From there, they get into building functions including rules around functions, how parameters are interpreted, and good tips for naming functions.

Chapter 3 is the money chapter for me: it introduces some of the key functional programming constructs in R, particularly in the purrr package. Higher-order functions and lambda expressions can be a bit complex to explain, but Charlotte does a fantastic job of it.

One area where I learned a lot was in chapters 4 and 5, dealing with writing safe functions. I haven’t been that good about writing robust R code, so learning about stopifnot, safely, and the like was enlightening.

This is a really good course. The course progresses well enough that you hardly notice how far along you’re getting—it’s a smooth learning curve and I don’t remember once having to stop while working on an exercise, look things up, and read up on a bunch of stuff not covered in the course. To me, that’s the sign of a successful DataCamp course.

Advertisements

PolyBase Revealed: Changing Dates With RegEx

This week’s post has to do with handling dates in files. One of the data sets I use in demos for PolyBase Revealed is an open data set of Raleigh, North Carolina police incidents. This data set is nice because it’s enough rows not to be a trivial example, but still small enough that people can download and use it without too much issue. It also has relatively few data issues and a little bit of column variety.

A Non-Problem Problem

If you download the latest version of that data set as a spreadsheet, you get a CSV file. Open up that CSV file in the editor of your choice and you’ll see dates in a particular format:

There was this one time I had two dates but I couldn’t let one either about the other.

This is an ISO 8601 date format for date and time in UTC. It is a good format because it tells you the date, the time (down to milliseconds), and that the date is in UTC format (time zone Z). The T in the middle ensures that there are no spaces in the time, so parsers looking to split on space don’t get confused. All in all, this is a good format. But PolyBase doesn’t support it. That’s a failing on PolyBase’s part rather than the process which generated this data because this is an international standard format for dates and times, not something goofy that someone came up with one day.

PolyBase has a very limited number of supported date formats. This ISO standard is not one of the supported date formats, so if I were to import the data and set these two columns to DATETIME2(3), DATETIME, or any other date format, PolyBase would choke. That leaves us with a few options:

  1. Change the data as it gets pushed out from the source system. Sometimes you can control this, and if you can, that’s the easiest approach. In this case, because I’m getting data from a public data set, that’s not really an option.
  2. Load the data as VARCHAR rather than a date or time format. This option is okay if you don’t actually need the data as dates or if you don’t care about those particular columns.
  3. Convert the input file to ORC or Parquet format. I’ll have a blog post on that sometime in the future, as I think it’s a useful skill to know. But that’s not going to be an ideal solution across the board.
  4. Convert the data as a pre-processing step. Read it into some system, make relevant changes, and write it out in a way that PolyBase can understand.

I’m going to do a simple form of step 4 in this blog post.

Now You Have Two Problems

The way that we’re going to replace the dates in this file using a format that PolyBase can understand. Using the first date as an example, the file has 2014-11-08T15:47:00.000Z. Looking at the list of supported date formats above, PolyBase can handle yyyy-MM-dd HH:mm:ss.fff format, so we can turn the date into 2014-11-08 15:47:00.000 and the PolyBase engine can understand this and convert it to an appropriate date format. Our fix here is to replace the T with a space and remove the Z from any date.

It turns out that there is a nice regular expression for this. I’m using Notepad++ to make this change, though other tools will also allow for regular expression-based find-and-replace operations. In Notepad++, hit Ctrl+H to bring up the replace menu. Be sure that you have selected “Regular expression” for the search mode and enter the following into the find and replace boxes.

Find: (\d{4}-\d{2}-\d{2})T(\d{2}:\d{2}:\d{2}.\d{3})Z

Replace: \1 \2

Your menu should look a bit like the one below:

Find and replace based on a regular expression. What could possibly go wrong?

What we’re doing is building out two capture groups in our regular expression: things which follow a particular shape and are wrapped in parentheses. In our case, the first capture group is the year-month-day combo in the format 4 digits, followed by a hyphen, followed by two digits, followed by a hyphen, followed by two more digits. As far as our example goes, that would be 2014-11-08.

The second capture group is the time portion: two digits, a colon, two more digits, another colon, two more digits, a decimal point, and three digits. In our example, that would be 15:47:00.000.

The reason we care about these capture groups is that we can reference them as they are using backslashes. In the replace, I’m going to write out the first capture group, then a space, and then the second capture group. We get rid of the T and Z characters while retaining the actual values for date and time even when we do not know the particulars of that date and time. That’s important to remember here: I don’t want to try to figure out every combination of day followed by the letter T followed by hour, and I don’t want to risk false positives around strings which have a T or Z in them.

If I run Replace All with this regular expression, it makes two replacements for each line in the file. On my under-powered work machine, that took about 10 seconds to perform nearly 500K replacements. It’s not the fastest method for making this change and I wouldn’t try it on files with tens of millions of rows or scenarios where I’d have to do this every day or every week, but for a one-off job, it does alright.

If you don’t want to go through this hassle in the future, please upvote Greg Galloway’s proposal to improve date format support in PolyBase. And hope that this improvement (if it happens) makes it into the on-prem product.

Southern Fried F# Call For Speakers Closes Soon

If you are interested in submitting to Southern Fried F#, the call for speakers closes this Friday, February 15th. Submit your proposal for an F#-related talk soon!

Even if you aren’t interested in speaking, tickets are available. This is a free conference which will take place on Saturday, April 13th in downtown Raleigh, North Carolina. Seating is limited, so be sure to sign up and reserve your spot.

The Value of Calendar Tables, Part 2: Simplify Queries

In the last post, I showed how to create a calendar table. In this post, we’ll look at a few places where calendar tables can simplify your queries significantly.

Finding Valid Dates

First up, let’s do a little bit of SQL Saturday Raleigh prep (notice how I slid that advertisement in so subtly?). We normally want to host our event sometime in April, so let’s look for valid dates over the next three years, as well as 2019 itself. The query itself is trivial:

SELECT
	c.Date
FROM dbo.Calendar c
WHERE
	c.MonthName = 'April'
	AND c.DayName = 'Saturday'
	AND c.CalendarYear BETWEEN 2019 AND 2022
ORDER BY
	c.Date;
That feeling when the guy in charge of the event says “No, I didn’t say pick one Saturday in April; I said pick every Saturday in April!”

Doing this in T-SQL without the assistance of a calendar table isn’t too awful, but it’s not as easy. With a tally table, you could build out enough rows and perform date calculations and filters…but it’s going to take you more than 9 lines of code.

Those Jealous Bunnies

There’s one problem with picking a SQL Saturday in April: Easter and Passover tend to run right around that time, and nobody wants a SQL Saturday on Passover or the day before Easter. Unfortunately, our calendar table doesn’t include holiday information. So let’s add it!

ALTER TABLE dbo.Calendar ADD HolidayName VARCHAR(40) NULL;

Now we have a column for holiday…but how do we get Easter and Passover info? Daniel Hutmacher has our back for Easter. Here’s my set-based version of his function:

CREATE TABLE #EasterBuilder
(
	Year INT,
	a TINYINT,
	b TINYINT,
	c TINYINT,
	d TINYINT,
	e TINYINT,
	f TINYINT,
	g TINYINT,
	h TINYINT,
	i TINYINT,
	k TINYINT,
	l TINYINT,
	M TINYINT,
	EasterDay DATE
);

INSERT INTO #EasterBuilder 
(
	Year,
	a,
	b,
	c
)
SELECT DISTINCT
	c.CalendarYear,
	c.CalendarYear % 19,
	FLOOR(1.0 * c.CalendarYear / 100),
	c.CalendarYear % 100
FROM dbo.Calendar c;

UPDATE #EasterBuilder
SET
	d = FLOOR(1.0 * b / 4),
	e = b % 4,
	f = FLOOR((8.0 + b) / 25);

UPDATE #EasterBuilder
SET
	g = FLOOR((1.0 + b - f) / 3);

UPDATE #EasterBuilder
SET
	h = (19 * a + b - d - g + 15) % 30,
	i = FLOOR(1.0 * c / 4),
	k = year % 4;

UPDATE #EasterBuilder
SET
	l = (32.0 + 2 * e + 2 * i - h - k) % 7;

UPDATE #EasterBuilder
SET
	m = FLOOR((1.0 * a + 11 * h + 22 * l) / 451);

UPDATE #EasterBuilder
SET
	EasterDay =
	DATEADD(dd, (h + l - 7 * m + 114) % 31,
	DATEADD(mm, FLOOR((1.0 * h + l - 7 * m + 114) / 31) - 1,
	DATEADD(yy, year - 2000, { D '2000-01-01' })
	)
	);

UPDATE c
SET
	HolidayName = 'Easter'
FROM dbo.Calendar c
	INNER JOIN #EasterBuilder eb
		ON c.Date = eb.EasterDay;

For Passover and a bunch of other holidays, I found a great set of functions by Karl Schmitt. This time, instead of converting functions to set-based operations, I’m going to use the functions as-is. By the way, if you want an easy script to insert into HolidayDef, here you go:

INSERT INTO dbo.HolidayDef 
(
	HolidayKey,
	OffsetKey,
	Type,
	FixedMonth,
	FixedDay,
	DayOfWeek,
	WeekOfMonth,
	Adjustment,
	HolidayName
)
VALUES
( 1, 0,'F', 1, 1,0,0,  0,'New Year''s Day'),
( 2, 0,'M', 1, 0,2,3,  0,'Martin Luther King Jr''s BD (Observed)'),
( 3, 0,'F', 2, 2,0,0,  0,'Ground Hog Day'),
( 4, 0,'F', 2,12,0,0,  0,'Lincoln''s Birthday'),
( 5, 0,'F', 2,14,0,0,  0,'Valentine''s Day'),
( 6, 0,'M', 2, 0,2,3,  0,'President''s Day'),
( 7,18,'O', 0, 0,0,0,-47,'Paczki Day (Mardi Gras)'),
( 8,18,'O', 0, 0,0,0,-46,'Ash Wednesday'),
( 9, 0,'F', 2,22,0,0,  0,'Washington''s Birthday'),
(10,15,'O', 0, 0,0,0,-30,'Purim'),
(11, 0,'F', 3,17,0,0,  0,'St. Patrick''s Day'),
(12, 0,'F', 3,19,0,0,  0,'St. Joseph''s Day'),
(13,18,'O', 0, 0,0,0,-14,'Passion Sunday'),
(14,18,'O', 0, 0,0,0, -7,'Palm Sunday'),
(15, 0,'S', 0, 0,0,0,  0,'Passover'),
(16,18,'O', 0, 0,0,0, -2,'Good Friday'),
(17, 0,'M', 4, 0,1,1,  0,'Daylight Savings Begins'),
(18, 0,'S', 0, 0,0,0,  0,'Easter Sunday'),
(19, 0,'S', 0, 0,0,0,  0,'Orthodox Easter'),
(20, 0,'M', 5, 0,7,1, -10,'Administrative Professionals Day'),
(21, 0,'F', 4,22,0,0,  0,'Earth Day'),
(22, 0,'M', 5, 0,1,2,  0,'Mother''s Day'),
(23, 0,'M', 5, 0,7,3,  0,'Armed Forces Day'),
(24, 0,'F', 5,31,0,0,  0,'Memorial Day'),
(25, 0,'F', 6,14,0,0,  0,'Flag Day'),
(26, 0,'M', 6, 0,1,3,  0,'Father''s Day'),
(27,18,'O', 0, 0,0,0, 49,'Pentecost'),
(28, 0,'F', 7, 4,0,0,  0,'Independence Day'),
(29, 0,'M', 9, 0,2,1,  0,'Labor Day'),
(30,15,'O', 0, 0,0,0,163,'Rosh Hashanah'),
(31, 0,'M', 9, 0,1,2,  0,'Grandparents Day'),
(32,15,'O', 0, 0,0,0,172,'Yom Kippur'),
(33,18,'O', 0, 0,0,0, 39,'Ascension Day'),
(34, 0,'F',10, 9,0,0,  0,'Leif Erikson Day'),
(35, 0,'M',10, 0,1,2,  0,'National Children''s Day'),
(36, 0,'M',10, 0,3,2,  0,'Columbus Day (Traditional)'),
(37, 0,'F',10,16,0,0,  0,'Boss''s Day'),
(38, 0,'M',10, 0,7,3,  0,'Sweetest Day'),
(39, 0,'M',11, 0,1,1, -7,'Daylight Savings Ends'),
(40, 0,'F',10,31,0,0,  0,'Halloween'),
(41, 0,'F',11, 1,0,0,  0,'All Saint''s Day'),
(42, 0,'M',11, 0,2,1,  1,'Election Day'),
(43, 0,'F',11,11,0,0,  0,'Veterans Day'),
(44, 0,'M',11, 0,5,4,  0,'Thanksgiving Day'),
(45, 0,'S', 0, 0,0,0,  0,'Chanukah'),
(46,18,'O', 0, 0,0,0, 56,'Trinity Sunday'),
(47, 0,'F',12,25,0,0,  0,'Christmas Day'),
(48,15,'O', 0, 0,0,0,177,'Sukkot'),
(49,15,'O', 0, 0,0,0,184,'Shemini Atzeret'),
(50,15,'O', 0, 0,0,0,185,'Simhat Torah (outside Isreal)'),
(51, 0,'F', 3,19,0,0,  0,'St. Josephs Day'),
(52,15,'O', 0, 0,0,0, 33,'Lag B''Omar'),
(53,15,'O', 0, 0,0,0, 50,'Shavuot'),
(54, 0,'S', 0, 0,0,0,  0,'Tu Bishvat'),
(55, 0,'S', 0, 0,0,0,  0,'Yom HaAtzma''ut'),
(56, 0,'S', 0, 0,0,0,  0,'Tisha B''Av');

I can then call dbo.GetHolidayDates for the time frame 1800-01-01 until 2525-12-31 and wait a little while because it’s a bunch of nested functions. Here’s how we get them into our holiday table:

CREATE TABLE #HolidayDays
(
	HolidayKey INT,
	HolidayDate DATE
);

INSERT INTO #HolidayDays 
(
	HolidayKey,
	HolidayDate
)
SELECT
	HolidayKey,
	HolidayDate
FROM dbo.GetHolidayDates(0, '1800-01-01', '2525-12-31');

UPDATE c
SET
	HolidayName = h.HolidayName
FROM dbo.Calendar c
	INNER JOIN #HolidayDays hd
		ON c.Date = hd.HolidayDate
	INNER JOIN dbo.HolidayDef h
		ON hd.HolidayKey = h.HolidayKey;

Now that I have this, I can make sure that we don’t schedule our SQL Saturday on the first day of Passover or on the day before Easter Sunday with the following:

USE Scratch
GO
WITH aprils AS
(
	SELECT
		c.Date,
		c.DayName,
		c.CalendarYear,
		c.HolidayName,
		LEAD(c.HolidayName, 1) OVER (ORDER BY c.Date) AS NextDayHoliday
	FROM dbo.Calendar c
	WHERE
		c.MonthName = 'April'
		AND c.CalendarYear BETWEEN 2019 AND 2022
)
SELECT
	a.Date,
	a.CalendarYear,
	a.NextDayHoliday
FROM aprils a
WHERE
	a.DayName = 'Saturday'
	AND ISNULL(a.HolidayName, '') <> 'Passover'
	AND ISNULL(a.NextDayHoliday, '') <> 'Easter Sunday'
ORDER BY
	a.Date;

This query got a little more complicated because of the way the LEAD window function works. LEAD runs after the WHERE clause, so I need to include at least Saturdays and Sundays. I decided to include all days in April just to make sure nothing wacky happens.

This gives me the following results:

Finding the right days. Like April 27th.

Fiscal Weeks of Year

Let’s say that I need to schedule a meeting for fiscal week 33 of the 2020 fiscal year. Finding this for calendar year isn’t too bad, but because companies have different fiscal years, you’ll have some trouble finding that online very easily. But we already took care of that in the past post, so this one’s easy again:

SELECT
	c.Date
FROM dbo.Calendar c
WHERE
	c.FiscalWeekOfYear = 33
	AND c.FiscalYear = 2020
ORDER BY
	c.Date ASC;
The next days I have open on my calendar.

Conclusion

In this post, we took a look at two ways that having a calendar table can make calculations significantly easier. As you have more complicated date queries, you’ll find that calendar tables can help out significantly and get you results quickly because those expensive calculations (like figuring out what day Easter Sunday lands on this year) only have to run once.

A Visit To China

As this post goes live, I will be in China. I have some blog posts scheduled so there should be some content coming up. Depending on whether I can get access while out there, I might throw in an update too.

In the meantime, you can constantly hit F5 while looking at my Instagram feed to see if I’ve uploaded any pictures.

The Value Of Calendar Tables, Part 1: Build A Table

Calendar tables (AKA date dimensions) are a fundamental component of any data warehouse, but these tables are also quite useful for OLTP systems. In this two-part series, we will look first at how to build a calendar table and then, in the second part of the series, we will see some reasons why you might want a calendar table even in your transactional processing systems.

Tip 1: Don’t Reinvent the Wheel

You can find oodles of posts out there about building calendar tables or date dimensions. One of my favorites comes from Aaron Bertrand, whose code I adopted and adapted for this post. You can easily find dozens of versions of these types of tables, so before you decide you need to build your own from first principles, gain some inspiration from what others have done before.

Tip 2: Don’t Sweat Load Time

Another thing to keep in mind here is that you’re only going to load your calendar table once, so if it takes two minutes to do, who really cares? The version I have should run reasonably fast–I calculated 726 years on slow hardware in 19 seconds and fast hardware in 11 seconds. I’m sure you can play code golf and get it done faster, but that’s probably not a good use of your time.

What you want to sweat instead is query time: how long is it taking to access this data?

The Script

First, let’s create a calendar table. Note that I added a primary key on the date column and a unique key on the DateKey column. In a warehouse, we would use DateKey everywhere; in a transactional system, I’d be more inclined to use the date itself.

CREATE TABLE dbo.Calendar
(
	DateKey INT NOT NULL,
	[Date] DATE NOT NULL,
	[Day] TINYINT NOT NULL,
	DayOfWeek TINYINT NOT NULL,
	DayName VARCHAR(10) NOT NULL,
	IsWeekend BIT NOT NULL,
	DayOfWeekInMonth TINYINT NOT NULL,
	CalendarDayOfYear SMALLINT NOT NULL,
	WeekOfMonth TINYINT NOT NULL,
	CalendarWeekOfYear TINYINT NOT NULL,
	CalendarMonth TINYINT NOT NULL,
	MonthName VARCHAR(10) NOT NULL,
	CalendarQuarter TINYINT NOT NULL,
	CalendarQuarterName CHAR(2) NOT NULL,
	CalendarYear INT NOT NULL,
	FirstDayOfMonth DATE NOT NULL,
	LastDayOfMonth DATE NOT NULL,
	FirstDayOfWeek DATE NOT NULL,
	LastDayOfWeek DATE NOT NULL,
	FirstDayOfQuarter DATE NOT NULL,
	LastDayOfQuarter DATE NOT NULL,
	CalendarFirstDayOfYear DATE NOT NULL,
	CalendarLastDayOfYear DATE NOT NULL,
	FirstDayOfNextMonth DATE NOT NULL,
	CalendarFirstDayOfNextYear DATE NOT NULL,
	FiscalDayOfYear SMALLINT NOT NULL,
	FiscalWeekOfYear TINYINT NOT NULL,
	FiscalMonth TINYINT NOT NULL,
	FiscalQuarter TINYINT NOT NULL,
	FiscalQuarterName CHAR(2) NOT NULL,
	FiscalYear INT NOT NULL,
	FiscalFirstDayOfYear DATE NOT NULL,
	FiscalLastDayOfYear DATE NOT NULL,
	FiscalFirstDayOfNextYear DATE NOT NULL,
	CONSTRAINT [PK_Calendar] PRIMARY KEY CLUSTERED([Date]),
	CONSTRAINT [UKC_Calendar] UNIQUE(DateKey)
);
GO

Now that we have the table, let’s load it up. First the script and then I’ll explain it in bits and pieces.

DECLARE
	@StartDate DATE = '18000101',
	@NumberOfYears INT = 726;

--Remove ambiguity with regional settings.
SET DATEFIRST 7;
SET DATEFORMAT mdy;
SET LANGUAGE US_ENGLISH;

DECLARE
	@EndDate DATE = DATEADD(YEAR, @NumberOfYears, @StartDate);

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)
INSERT INTO dbo.Calendar
(
	DateKey,
	[Date],
	[Day],
	[DayOfWeek],
	[DayName],
	IsWeekend,
	DayOfWeekInMonth,
	CalendarDayOfYear,
	WeekOfMonth,
	CalendarWeekOfYear,
	CalendarMonth,
	[MonthName],
	CalendarQuarter,
	CalendarQuarterName,
	CalendarYear,
	FirstDayOfMonth,
	LastDayOfMonth,
	FirstDayOfWeek,
	LastDayOfWeek,
	FirstDayOfQuarter,
	LastDayOfQuarter,
	CalendarFirstDayOfYear,
	CalendarLastDayOfYear,
	FirstDayOfNextMonth,
	CalendarFirstDayOfNextYear,
	FiscalDayOfYear,
	FiscalWeekOfYear,
	FiscalMonth,
	FiscalQuarter,
	FiscalQuarterName,
	FiscalYear,
	FiscalFirstDayOfYear,
	FiscalLastDayOfYear,
	FiscalFirstDayOfNextYear
)
SELECT
	CAST(D.DateKey AS INT) AS DateKey,
	D.[DATE] AS [Date],
	CAST(D.[day] AS TINYINT) AS [day],
	CAST(d.[dayofweek] AS TINYINT) AS [DayOfWeek],
	CAST(DATENAME(WEEKDAY, d.[Date]) AS VARCHAR(10)) AS [DayName],
	CAST(CASE WHEN [DayOfWeek] IN (1, 7) THEN 1 ELSE 0 END AS BIT) AS [IsWeekend],
	CAST(ROW_NUMBER() OVER (PARTITION BY d.FirstOfMonth, d.[DayOfWeek] ORDER BY d.[Date]) AS TINYINT) AS DayOfWeekInMonth,
	CAST(DATEPART(DAYOFYEAR, d.[Date]) AS SMALLINT) AS CalendarDayOfYear,
	CAST(DENSE_RANK() OVER (PARTITION BY d.[year], d.[month] ORDER BY d.[week]) AS TINYINT) AS WeekOfMonth,
	CAST(d.[week] AS TINYINT) AS CalendarWeekOfYear,
	CAST(d.[month] AS TINYINT) AS CalendarMonth,
	CAST(d.[monthname] AS VARCHAR(10)) AS [MonthName],
	CAST(d.[quarter] AS TINYINT) AS CalendarQuarter,
	CONCAT('Q', d.[quarter]) AS CalendarQuarterName,
	d.[year] AS CalendarYear,
	d.FirstOfMonth AS FirstDayOfMonth,
	MAX(d.[Date]) OVER (PARTITION BY d.[year], d.[month]) AS LastDayOfMonth,
	MIN(d.[Date]) OVER (PARTITION BY d.[year], d.[week]) AS FirstDayOfWeek,
	MAX(d.[Date]) OVER (PARTITION BY d.[year], d.[week]) AS LastDayOfWeek,
	MIN(d.[Date]) OVER (PARTITION BY d.[year], d.[quarter]) AS FirstDayOfQuarter,
	MAX(d.[Date]) OVER (PARTITION BY d.[year], d.[quarter]) AS LastDayOfQuarter,
	FirstOfYear AS CalendarFirstDayOfYear,
	MAX(d.[Date]) OVER (PARTITION BY d.[year]) AS CalendarLastDayOfYear,
	DATEADD(MONTH, 1, d.FirstOfMonth) AS FirstDayOfNextMonth,
	DATEADD(YEAR, 1, d.FirstOfYear) AS CalendarFirstDayOfNextYear,
	DATEDIFF(DAY, fy.FYStart, d.[Date]) + 1 AS FiscalDayOfYear,
	DATEDIFF(WEEK, fy.FYStart, d.[Date]) + 1 AS FiscalWeekOfYear,
	CASE
		WHEN d.[month] >= 7 THEN d.[month] - 6
		ELSE d.[month] + 6
	END AS FiscalMonth,
	CASE d.[quarter]
		WHEN 1 THEN 3
		WHEN 2 THEN 4
		WHEN 3 THEN 1
		WHEN 4 THEN 2
	END AS FiscalQuarter,
	CONCAT('Q', CASE d.[quarter]
		WHEN 1 THEN 3
		WHEN 2 THEN 4
		WHEN 3 THEN 1
		WHEN 4 THEN 2
	END) AS FiscalQuarterName,
	YEAR(fy.FYStart) AS FiscalYear,
	fy.FYStart AS FiscalFirstDayOfYear,
	MAX(d.[Date]) OVER (PARTITION BY fy.FYStart) AS FiscalLastDayOfYear,
	DATEADD(YEAR, 1, fy.FYStart) AS FiscalFirstDayOfNextYear
FROM Nums n
	CROSS APPLY
	(
		SELECT
			DATEADD(DAY, n - 1, @StartDate) AS [DATE]
	) d0
	CROSS APPLY
	(
		SELECT
			d0.[date],
			DATEPART(DAY, d0.[date]) AS [day],
			DATEPART(MONTH, d0.[date]) AS [month],
			CONVERT(DATE, DATEADD(MONTH, DATEDIFF(MONTH, 0, d0.[date]), 0)) AS FirstOfMonth,
			DATENAME(MONTH, d0.[date]) AS [MonthName],
			DATEPART(WEEK, d0.[date]) AS [week],
			DATEPART(WEEKDAY, d0.[date]) AS [DayOfWeek],
			DATEPART(QUARTER, d0.[date]) AS [quarter],
			DATEPART(YEAR, d0.[date]) AS [year],
			CONVERT(DATE, DATEADD(YEAR, DATEDIFF(YEAR, 0, d0.[date]), 0)) AS FirstOfYear,
			CONVERT(CHAR(8), d0.[date], 112) AS DateKey
	) d
	CROSS APPLY
	(
		SELECT
			--Fiscal year starts July 1.
			FYStart = DATEADD(MONTH, -6, DATEADD(YEAR, DATEDIFF(YEAR, 0, DATEADD(MONTH, 6, d.[date])), 0))
	) fy
	CROSS APPLY
	(
		SELECT
			FYyear = YEAR(fy.FYStart)
	) fyint
WHERE
	n.n <= DATEDIFF(DAY, @StartDate, @EndDate)
ORDER BY
	[date] OPTION (MAXDOP 1);
GO

Second Verse Like the First, but Slower

There’s a bit going on there, so let’s take this a piece at a time.

DECLARE
	@StartDate DATE = '18000101',
	@NumberOfYears INT = 726;

Our august corporation was founded on the first of January in 1800 and will survive in the year 2525. Because that song is a pack of lies I tell you.

Next, I pulled these settings from Aaron’s post. This way region differences won’t show up. You’re in the US now.

--Remove ambiguity with regional settings.
SET DATEFIRST 7;
SET DATEFORMAT mdy;
SET LANGUAGE US_ENGLISH;

The next bit has us build up a CTE-based tally table, a construct useful enough to keep copying. Or one of these days I’ll memorize it.

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)

Now I’m going to do something completely different: start with the FROM clause. Let’s ignore the SELECT portion for a bit.

FROM Nums n
	CROSS APPLY
	(
		SELECT
			DATEADD(DAY, n - 1, @StartDate) AS [DATE]
	) d0
	CROSS APPLY
	(
		SELECT
			d0.[date],
			DATEPART(DAY, d0.[date]) AS [day],
			DATEPART(MONTH, d0.[date]) AS [month],
			CONVERT(DATE, DATEADD(MONTH, DATEDIFF(MONTH, 0, d0.[date]), 0)) AS FirstOfMonth,
			DATENAME(MONTH, d0.[date]) AS [MonthName],
			DATEPART(WEEK, d0.[date]) AS [week],
			DATEPART(WEEKDAY, d0.[date]) AS [DayOfWeek],
			DATEPART(QUARTER, d0.[date]) AS [quarter],
			DATEPART(YEAR, d0.[date]) AS [year],
			CONVERT(DATE, DATEADD(YEAR, DATEDIFF(YEAR, 0, d0.[date]), 0)) AS FirstOfYear,
			CONVERT(CHAR(8), d0.[date], 112) AS DateKey
	) d

In this first block, we convert the numbers into dates starting with our pre-defined start date. Aaron’s post had him inserting into a temp table; I decided to do everything in a single query instead and used CROSS APPLY to hide away some of these repeatedly-copied segments.

Then, we have a couple more calculations to get fiscal year. In this scenario, my fiscal year starts July 1st and runs through June 30th. If your company has a fiscal year which starts in October, change the 6 below to a 10. If your company has a fiscal year which starts in the middle of the month, the story is a bit more complicated but it’s still doable.

	CROSS APPLY
	(
		SELECT
			--Fiscal year starts July 1.
			FYStart = DATEADD(MONTH, -6, DATEADD(YEAR, DATEDIFF(YEAR, 0, DATEADD(MONTH, 6, d.[date])), 0))
	) fy
	CROSS APPLY
	(
		SELECT
			FYyear = YEAR(fy.FYStart)
	) fyint
WHERE
	n.n <= DATEDIFF(DAY, @StartDate, @EndDate)

Now we will take a look at the columns coming back. I’m going to break these into three segments: common, calendar year, and fiscal year. You don’t need a fiscal date and a calendar date—they’re both the same date. The calendar versus fiscal split only affects things which depend upon the starting point of the year, not within a week or month.

First up is common columns:

	CAST(D.DateKey AS INT) AS DateKey,
	D.[DATE] AS [Date],
	CAST(D.[day] AS TINYINT) AS [day],
	CAST(d.[dayofweek] AS TINYINT) AS [DayOfWeek],
	CAST(DATENAME(WEEKDAY, d.[Date]) AS VARCHAR(10)) AS [DayName],
	CAST(CASE WHEN [DayOfWeek] IN (1, 7) THEN 1 ELSE 0 END AS BIT) AS [IsWeekend],
	CAST(ROW_NUMBER() OVER (PARTITION BY d.FirstOfMonth, d.[DayOfWeek] ORDER BY d.[Date]) AS TINYINT) AS DayOfWeekInMonth,
	CAST(DENSE_RANK() OVER (PARTITION BY d.[year], d.[month] ORDER BY d.[week]) AS TINYINT) AS WeekOfMonth,
	CAST(d.[monthname] AS VARCHAR(10)) AS [MonthName],
	d.FirstOfMonth AS FirstDayOfMonth,
	MAX(d.[Date]) OVER (PARTITION BY d.[year], d.[month]) AS LastDayOfMonth,
	MIN(d.[Date]) OVER (PARTITION BY d.[year], d.[week]) AS FirstDayOfWeek,
	MAX(d.[Date]) OVER (PARTITION BY d.[year], d.[week]) AS LastDayOfWeek,
	MIN(d.[Date]) OVER (PARTITION BY d.[year], d.[quarter]) AS FirstDayOfQuarter,
	MAX(d.[Date]) OVER (PARTITION BY d.[year], d.[quarter]) AS LastDayOfQuarter,
	DATEADD(MONTH, 1, d.FirstOfMonth) AS FirstDayOfNextMonth,

Then calendar year-specific items. These are the things that will differ from calendar to fiscal year, such as what day number we’ve reached or what quarter we’re in.

	CAST(DATEPART(DAYOFYEAR, d.[Date]) AS SMALLINT) AS CalendarDayOfYear,
	CAST(d.[week] AS TINYINT) AS CalendarWeekOfYear,
	CAST(d.[month] AS TINYINT) AS CalendarMonth,
	CAST(d.[quarter] AS TINYINT) AS CalendarQuarter,
	CONCAT('Q', d.[quarter]) AS CalendarQuarterName,
	d.[year] AS CalendarYear,
	FirstOfYear AS CalendarFirstDayOfYear,
	MAX(d.[Date]) OVER (PARTITION BY d.[year]) AS CalendarLastDayOfYear,
	DATEADD(YEAR, 1, d.FirstOfYear) AS CalendarFirstDayOfNextYear,

Finally, we have fiscal year details whose columns match the calendar year details.

	DATEDIFF(DAY, fy.FYStart, d.[Date]) + 1 AS FiscalDayOfYear,
	DATEDIFF(WEEK, fy.FYStart, d.[Date]) + 1 AS FiscalWeekOfYear,
	CASE
		WHEN d.[month] >= 7 THEN d.[month] - 6
		ELSE d.[month] + 6
	END AS FiscalMonth,
	CASE d.[quarter]
		WHEN 1 THEN 3
		WHEN 2 THEN 4
		WHEN 3 THEN 1
		WHEN 4 THEN 2
	END AS FiscalQuarter,
	CONCAT('Q', CASE d.[quarter]
		WHEN 1 THEN 3
		WHEN 2 THEN 4
		WHEN 3 THEN 1
		WHEN 4 THEN 2
	END) AS FiscalQuarterName,
	YEAR(fy.FYStart) AS FiscalYear,
	fy.FYStart AS FiscalFirstDayOfYear,
	MAX(d.[Date]) OVER (PARTITION BY fy.FYStart) AS FiscalLastDayOfYear,
	DATEADD(YEAR, 1, fy.FYStart) AS FiscalFirstDayOfNextYear

One thing to note is that I got lazy with the fiscal month and quarter. I probably should have driven from my fiscal start of year date, but because I knew it started in Q3 of the prior calendar year, I could easily do a translation.

I Need More Columns!

If you have a need for more columns, the nice thing about calendar tables is that you can add them. And like I said above, don’t sweat load performance. If you load part of the table and then perform a series of updates off of it, everything’s still fine because you’re only running the scripts once. If you want holidays, Aaron’s post has some or you can check out Daniel Hutmacher’s post.

If you need other columns, just figure out the calculation and extend what you’ve got.

The Inevitable Cliffhanger

At this point, we have a calendar table. In the next post, I’ll show you a few neat tricks you can do with one. Stay tuned!

PolyBase Revealed: Want To Help?

This week’s PolyBase Revealed post is a request. Have you used PolyBase in production? Alternatively, have you tried to implement PolyBase but ended up having to back off for some reason? If so, I’d be interested in your thoughts.

One of the things I want to do is cover as many weird scenarios as I can fit into my allotment of pages. I know I won’t be able to cover everything, but knowing where people have run into trouble can help out future users. And given that we’re looking at several technology stacks interacting, there are certainly some weird scenarios, many of which I wouldn’t even be able to dream of.

If you have wisdom you’d like to share, I’d like to have a chat at a SQL Saturday, via e-mail, or through a series of easily misinterpreted passive-aggressive tweets (or maybe thoughtful DMs if that’s not too much to ask).

Ways to get ahold of me:

  • E-mail feasel [at] my domain.
  • DM me.
  • Say my name three times in the mirror at a SQL Saturday. That won’t cause me to appear, though; I’ll have been at the event the entire time.