Here is a warning to anybody using data exfiltration protection in Azure Synapse Analytics. Hopefully it saves you some valuable time in troubleshooting.
The Setup
I have an Azure Synapse Analytics workspace which uses a managed virtual network and includes data exfiltration protection. I also have a Spark pool. My goal is to import a few packages and use them in a Spark notebook.
Doing so is pretty easy from the Synapse workspace. I navigate to the Manage hub and then choose Apache Spark pools from the Analytics pools menu. Select the ellipsis for my Spark pool and then choose Packages.

From there, because I plan to update Python packages, I can upload a requirements.txt file and have Pip do its job.

Send in my perfectly reasonable requirements.txt file and bam, life is good…or not.
The Error
In the Monitor hub, I can see that one of my Apache Spark applications has failed. Unfortunately, that’s the application which gets packages from Pip and makes them available to my Spark pool.

Reviewing the standard error file, I can see the problem immediately:

For posterity’s sake, here is the error message:
21/12/18 22:18:17 ERROR RawSocketSender: org.fluentd.logger.sender.RawSocketSender
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.fluentd.logger.sender.RawSocketSender.connect(RawSocketSender.java:85)
at org.fluentd.logger.sender.RawSocketSender.reconnect(RawSocketSender.java:94)
at org.fluentd.logger.sender.RawSocketSender.flush(RawSocketSender.java:193)
at org.fluentd.logger.sender.RawSocketSender.send(RawSocketSender.java:184)
at org.fluentd.logger.sender.RawSocketSender.emit(RawSocketSender.java:149)
at org.fluentd.logger.sender.RawSocketSender.emit(RawSocketSender.java:131)
at org.fluentd.logger.sender.RawSocketSender.emit(RawSocketSender.java:126)
at org.fluentd.logger.FluentLogger.log(FluentLogger.java:101)
at org.fluentd.logger.FluentLogger.log(FluentLogger.java:86)
at com.microsoft.mdsdclient.MessageSendingRunnable$1.call(Unknown Source)
at com.microsoft.mdsdclient.MessageSendingRunnable$1.call(Unknown Source)
at com.microsoft.mdsdclient.RetryUtil.retry(Unknown Source)
at com.microsoft.mdsdclient.RetryUtil.retry(Unknown Source)
at com.microsoft.mdsdclient.MessageSendingRunnable.sendMessage(Unknown Source)
at com.microsoft.mdsdclient.MessageSendingRunnable.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
That RawSocketSender
error means that the Spark driver was not able to open a connection to the outside world.
The Answer
In case you have already figured it out, congratulations: you were about 20 minutes ahead of me. It turns out that there’s a little message in the Synapse docs about using libraries in a Spark pool:
Installing packages from external repositories like PyPI, Conda-Forge, or the default Conda channels is not supported within data exfiltration protection enabled workspaces.
In retrospect, it all makes sense. The way data exfiltration protection works is that it prevents outbound internet traffic, instead only allowing services to communicate over managed private endpoints. The PyPi and Conda servers are definitely not managed private endpoints, and so those commands will fail.
What we need to do in this case is to upload the packages ourselves. There are a couple of ways to do this. If you only have one or two packages, you can easily use the GUI to install Wheel files. If you have a bunch of packages, I’d recommend creating a custom Conda channel. The upshot there is that you essentially grab all of the packages you need and write them into a storage account. As long as the Synapse workspace has a managed private endpoint into that storage account, you can then create a file which points to your storage account as a custom channel and then package management should work for you. That’s a lot more effort, but if you have data exfiltration protection on, you’re probably hosting sensitive data and so knowing that malicious Python packages won’t be able to dial out and exfiltrate data makes the exercise worthwhile.
2 thoughts on “Data Exfiltration Protection and Pip”