Today, we are announcing the retirement of PolyBase scale-out groups in Microsoft SQL Server. Scale-out group functionality will be removed from the product in SQL Server 2022. In-market SQL Server 2019, 2017, and 2016 will continue to support the functionality to the end of support for those products.
Okay, not dead-dead. But let’s talk about what this means.
PolyBase to Hadoop Really Is Dead
We could see the writing on the wall here ever since Cloudera and Hortonworks merged. Cloudera Distribution of Hadoop (CDH) and Hortonworks Data Platform (HDP) were both on-premises offerings that you could also get in the cloud. Post-merger, Cloudera Data Platform (CDP) was cloud-only and, to my knowledge, they have never released an on-premises version. Cloud versus on-premises isn’t itself the issue but it does tie in with the issue: in order for PolyBase to work, certain ports need to be exposed on your Hadoop cluster. Cloud offerings tend not to want to expose a bunch of ports to internal services and so PolyBase to CDP was a non-starter.
This also means that the PolyBase requirement for Java is gone. In the installer, I assume the “Java connector for HDFS data sources” is a goner.
PolyBase to Azure Blob Storage Lives
With SQL Server 2022, instead of using the Java-based HDFS driver, we will use REST APIs. This marks the third conceptual pattern in the PolyBase world in SQL Server:
- The first pattern, originally implemented in Parallel Data Warehouse and APS, uses Java to connect to HDFS. Because Azure Blob Storage happens to have a WebHDFS interface, it also worked.
- The second pattern uses ODBC drivers to connect to remote data sources. This opened up the door to a variety of data platform technologies.
- The third pattern uses REST APIs to retrieve structured data. Interestingly, this will still include a WebHDFS variant, so we can get data from Hadoop clusters even in CDP.
Scale-Out Groups Are Dead
The other key implication here is that scale-out groups are no more. If you weren’t familiar with them, check out my installation post from a while back. This was a first go at massive parallel processing within SQL Server itself, allowing you to link several SQL Server installations together to perform processing on external data. The original use case was to allow PolyBase to interact directly with data nodes but I had found some benefits with scale-out groups in the ODBC pattern (what I like to call V2) released in SQL Server 2019.
As an ideal, I was all about scale-out groups because there are certain classes of problem that scale-out can solve better than scale-up. That’s why we have dedicated SQL pools in Azure Synapse Analytics, for example. In an ideal world, scale-out groups would be the precursor for a full-blown MPP solution in the SQL Server space. That dream is gone. Instead, it’s all scale-up all the time.
In fairness, scale-out groups are complex to the point where very few demonstrators would even use them. I stopped using them in my demos, though I did have a whole section on how they were set up in the book.
I led off by jumping to an extreme. In reality, PolyBase is still around and it’s still quite useful as a data virtualization technology. The addition of REST API connectors to PolyBase looks to be quite promising; once we have a public build which includes them, I’ll have some fun checking them out.
My hope is that, by removing Java from PolyBase altogether, more companies will adopt the product. For that, we will need to see.
2 thoughts on “PolyBase versus the March of Time”