UPDATED 2018-03-13: SQL Server 2017 CU4 fixes this issue. See below.
We call SQL Server ML Services a lot. As in building hundreds of thousands of times a day to build models. It turns out that doing this has a negative effect: ML Services plans end up staying in the plan cache and don’t get removed. Here’s how our plan cache looks:
What happens is that things work fine for a while, until our plan cache hits about 70 GB, after which point we start getting RESOURCE_SEMAPHORE waits on some of our queries and the available space for buffer pool drops to single-digit gigabytes.
This is a problem on SQL Server 2016 and SQL Server 2017. It’s very unlikely to affect most people, as most people don’t do crazy stuff at this scale. But hey, what’s the fun in having a server of my own if I can’t bring it to its knees every once in a while?
The first thing that you might do here is try to run something like
DBCC FREEPROCCACHE or maybe
DBCC FREESYSTEMCACHE('SQL Plans') WITH MARK_IN_USE_FOR_REMOVAL; but neither of those did anything. It appears that R/ML Services plans are not marked for removal and will not clear, no matter how many times you try to flush the cache.
For now, the workaround I have is to restart the SQL Server service occasionally. You can see that I have done it twice in the above screenshot. Our application is resilient to short database downtimes, so this isn’t a bad workaround for us; it’s just a little bit of an annoyance.
One thing to keep in mind if you are in this scenario is that if you are running ML Services hundreds of thousands of times a day, your ExtensibilityData folders might have a lot of cruft which may prevent the Launchpad service from starting as expected. I’ve had to delete all folders in
\MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData\MSSQLSERVER01 after stopping the SQL Server service and before restarting it. The Launchpad service automatically does this, but if you have a huge number of folders in there, the service can time out trying to delete all of them. In my experience at least, the other folders didn’t have enough sub-folders inside to make it worth deleting, but that may just be an artifact of how we use ML Services.
I have worked with Microsoft on the issue and they’re going to release a patch in a future SQL Server 2017 CU to fix this issue. I’m not sure about SQL Server 2016 and also don’t know exactly when this patch will ship, but it’s working through the pipeline and I’m happy for that.
Microsoft has released SQL Server 2017 CU4, which fixes this buffer pool issue. After the patch, my SQL plan cache has not grown beyond 2 GB after 4 days, whereas prior to the patch, it’d be in the 50-60 GB range by then.