After the first Polybase statistics post, we left a bit defeated. I want to find an efficient query which doesn’t pull all of the rows back into SQL Server, thereby defeating one of the major benefits of Polybase: that I can move large amounts of data to Hadoop and use that relatively less expensive cluster to do a lot of the heavy lifting. Unfortunately, we left stymied. But like Rocky, we’re going to train harder, punch some slabs of meat, and have ourselves a montage. Consider this part of the training montage.
This is a short post today, but it’s a question that I was asked at a recent SQL Saturday when I gave my talk on integrating SQL Server with Hadoop: where do the stats for Polybase actually live? I didn’t know the answer at the time, so I promised I’d figured it out. Because it ties into our training so well, it seems like a great topic for today.
Where Do They Live?
As a quick refresher, we create stats on an external table using the CREATE STATISTICS command:
CREATE STATISTICS [Flights_Dest] ON [dbo].[Flights]([dest])
There exists a catalog view, sys.stats, which allows us to see certain information regarding a statistic. Here’s a sample query for our Flights_Dest statistic:
SELECT * FROM sys.stats WHERE object_id = OBJECT_ID('dbo.Flights') AND name = 'Flights_Dest';
If we run the execution plan for Flights_Dest against a different stat (such as the IX_Airports_State_City index I created), we’ll see that the query plans are exactly the same:
This says to me that Polybase statistics are stored on disk in the same place as regular statistics (and if you’re curious, Aaron Bertrand dug into that in some detail).
What You Can And Cannot Do With Stats Queries
One of the first things you might want to do is run DBCC SHOW_STATISTICS, and you get the results back that you’d expect:
DBCC SHOW_STATISTICS('dbo.Flights', 'Flights_Dest')
Next up, I want to see if the sys.stats_columns metadata table has my column information.
SELECT * FROM sys.stats_columns WHERE object_id = OBJECT_ID('dbo.Flights');
Looks like this is another successful result. I could join to sys.columns and learn that this statistic is on the dest column, and join to sys.stats to learn that its name is Flights_Dest.
Next up is a Dynamic Management Function, sys.dm_db_stats_properties.
SELECT * FROM sys.dm_db_stats_properties(OBJECT_ID('dbo.Flights'), 2); SELECT * FROM sys.dm_db_stats_properties(OBJECT_ID('dbo.airports'), 2);
This time, I included results for a different query as well, because we get no results back for this Polybase statistic. The MSDN page for this DMF does not specify either way whether we’d expect Polybase statistics to show up here, but my expectation was that they would. Here are the criteria that MSDN says could lead to an empty row:
sys.dm_db_stats_properties returns an empty rowset under any of the following conditions:
- object_id or stats_id is NULL.
- The specified object is not found or does not correspond to a table or indexed view.
- The specified statistics ID does not correspond to existing statistics for the specified object ID.
- The current user does not have permissions to view the statistics object.
Well, we definitely have values for object_id and stats_id (I also tried using the actual numeric values instead of calling the OBJECT_ID function). The object shows up in sys.tables and is, for our purposes, a valid table. The stats ID definitely shows up; I hard-coded the value based on what I saw in sys.stats. Finally, I am running as sysadmin and can see other statistics, so I couldn’t see a permissions issue being the problem.
Today, we learned that Polybase statistics are stored in the same way as other statistics; as far as SQL Server is concerned, they’re just more statistics built from a table (remembering that the way stats get created involves loading data into a temp table and building stats off of that temp table). We can do most of what you’d expect with these stats, but beware calling sys.dm_db_stats_properties() on Polybase stats, as they may not show up.
Also, remember that you cannot maintain, auto-create, auto-update, or otherwise modify these stats. The only way to modify Polybase stats is to drop and re-create them, and if you’re dealing with a large enough table, you might want to take a sample.