I ran into a really frustrating error around using Key Vault in Spark. In this post, I’ll cover the issue, what causes it, and how you can correct it.
The Issue: A Misleading Timeout
I want to use Azure Synapse Analytics with data exfiltration protection enabled. One of the main consequences of data exfiltration protection is that Spark pools cannot connect to outbound servers over the public internet, whether those servers are in Azure or not.
I wanted to retrieve a secret from Azure Key Vault via a Spark notebook, and so I used a pretty simple command:
TokenLibrary.getSecretWithLS("AzureKeyVault1", "SecretPassword"). This is a command to get a secret from an Azure Key Vault set up as a linked service. The linked service is named
AzureKeyVault1 and I know that I have a secret in the Key Vault named
SecretPassword. When running this, it spins for a minute and spits out an error message:
ERROR TokenLibraryLinkedService: POST failed com.twitter.util.TimeoutException: 1.minutes at com.twitter.util.Future.$anonfun$within$1(Future.scala:1642) at com.twitter.util.Future$$anon$4.apply$mcV$sp(Future.scala:1693) at com.twitter.util.Monitor.apply(Monitor.scala:46) at com.twitter.util.Monitor.apply$(Monitor.scala:41) at com.twitter.util.NullMonitor$.apply(Monitor.scala:229) at com.twitter.util.Timer.$anonfun$schedule$2(Timer.scala:39) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.twitter.util.Local$.let(Local.scala:4904) at com.twitter.util.Timer.$anonfun$schedule$1(Timer.scala:39) at com.twitter.util.Monitor.apply(Monitor.scala:46) at com.twitter.util.Monitor.apply$(Monitor.scala:41) at com.twitter.util.NullMonitor$.apply(Monitor.scala:229) at com.twitter.util.Timer.$anonfun$schedule$2(Timer.scala:39) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.twitter.util.Local$.let(Local.scala:4904) at com.twitter.util.Timer.$anonfun$schedule$1(Timer.scala:39) at com.twitter.util.JavaTimer.$anonfun$scheduleOnce$1(Timer.scala:233) at com.twitter.util.JavaTimer$$anon$3.run(Timer.scala:264) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505)
Based on this error message, the command is timing out after one minute. My initial thought is that there must be a network restriction preventing me from communicating with Key Vault, and I know what my first answer is.
Managed Private Endpoints
The only way Spark pools can communicate with any other machines when we have data exfiltration protection enabled is via managed private endpoints, which guarantee that network transit remains inside Azure.
In order to support this Key Vault access, I had already created a managed private endpoint to my Azure Key Vault:
I had also created a linked service to Azure Key Vault:
The Red Herring
Something stuck out to me here during my troubleshooting: the “Using private endpoint” column is empty. Furthermore, I noticed it odd that the linked service dialog for Key Vault differed from other services. For example, here is the linked service screen for Cosmos DB:
Meanwhile, Key Vault offers no such indicator:
I call this a red herring because it turns out that Key Vault actually does use the managed private endpoint despite there being no indication in the Synapse workspace. The error message from above actually has nothing to do with managed private endpoint issues even though it appeared to be the most obvious answer, considering that I ran into other data exfiltration protection issues whose solution was “You can’t go outbound to the public internet.”
The Real Answer: Permissions
It turns out that the connection timeout error was masking the actual problem. The way I figured out the issue was by trying to run Test Connection on my Key Vault linked service. The test succeeded when connecting to the Key Vault itself, which meant that there wasn’t a network problem. But when I tried to test retrieval of
SecretPassword, I ended up getting an error message indicating that I had not set up a Key Vault access policy.
When setting up a Key Vault access policy, you’ll need to include Get and List for whatever assets you need: keys, secrets, and certificates. If you want to allow write-based actions, you would of course need to grant sufficient rights here as well. You’ll also need to specify the Synapse Workspace principal. One important thing here is that you do not want to add an authorized application. Although you can find the Synapse application object in Azure Active Directory, do not include it in the access policy. Otherwise, you will get error messages like:
Microsoft.Azure.KeyVault.Models.KeyVaultErrorException: The policy requires the caller [...] to use on-behalf-of (OBO) flow. For help resolving this issue, please see https://go.microsoft.com/fwlink/?linkid=2125287
Here is a Q&A which explains a little more about the problem. And here is what my access policy looks like:
Once you create the access policy, Test Connection in the linked service configuration menu should work as expected:
And that’s the end of that. Unless it’s not.
Another Permissions Error
After doing this for my account, I tried with a separate account with lesser permissions. My Synapse Administrator account worked fine, but this account was still getting timeouts. When I checked the linked service test, I got a separate error:
When calling the
TokenLibrary command, this user gets a generic connection timeout. When trying to test the connection, we get a real error message. This error message is actually quite straightforward: the user account does not have the
useSecret permission. It turns out that
/workspaces/credentials/useSecret/action is only available to a couple Synapse RBAC roles: Synapse Administrator and Synapse Credential User. The less-privileged user account was a Compute Operator but did not have Credential User. Adding Credential User allowed me to retrieve secrets as this user.
When TokenLibrary fails in Azure Synapse Analytics your Spark pool, the connection timeout may actually be due to insufficient Key Vault privileges. Have the user test the linked service connection, both in connecting to the linked service as well as to a specific secret. You might have an issue with data exfiltration protection and no managed private endpoint, but if you do have a managed private endpoint set up, the most likely culprit is insufficient permissions and the timeout is hiding the real issue.
Coda: The Secret Password
You probably were wondering what the secret password was.