I have a new EC-Council course available via their CodeRed platform. The name of the course is Securing a Data Warehouse in Azure Synapse Analytics and it provides a rather detailed overview of security functionality in the product.
The impetus behind this course is that Azure Synapse Analytics is a huge platform, and when it comes to security, you need to think about:
- Who gets what access to the underlying resources (like data lake storage)?
- Who gets what access to the Synapse workspace?
- Who can create and destroy dedicated SQL pools and Spark pools?
- Who gets what access to data in dedicated SQL pools, Spark pools, and the serverless SQL pool?
- How do we prevent unauthorized access to resources?
- How do we deploy resources without the general public being able to access them?
- How do we securely collect information from external sources like Azure SQL Database, Cosmos DB, and on-premises servers?
- How do we ensure business continuity: disaster recovery and high availability?
That’s page one of 50 for a security review… My goal in this course is to provide an 80-40 solution for Azure Synapse Analytics security.
An 80-40 Solution?
There are two key variables in creating a training program: breadth and depth. There are also implicit and explicit constraints for solutions. As an example of an explicit constraint, the EC-Council requested that I not record a video longer than 15 minutes, as they’ve found that number to be about the maximum timespan before a person loses interest. Certain groups also have constraints on how they’d like a program to be structured: so many videos in a module, how we structure modules (e.g., two lessons in a chapter or three modules in a section), and the scope of a course.
When it comes to implicit constraints, these aren’t things you necessarily say out loud but they’re lurking behind the scenes. For example, a training course will often be 4-8 hours long, as the time and energy needed to create a 50-hour course way surpasses the return on investment in most cases. Similarly, we often cover the most significant cases in a training course, only sometimes touching on edge cases when they’re relevant. That can be frustrating if you get caught in an edge case, but the instructor may not know about or be able to fit in every possible failure scenario and how to extricate yourself from it.
With those constraints in mind, we have to decide on just how much breadth and how much depth we want to cover in the course. There is a natural trade-off between breadth and depth, and choosing to focus on one is at the exclusion of the other. This is a necessary fact of life, as we rarely encounter 100-100 solutions, in which we cover the entire breadth of a topic as well as all of the possible depth of a topic. That’s because we typically have a fixed amount of time, such as 4 hours for an online course or one semester of meeting for an hour twice a week at a college. And even if you think something is exhaustive of a topic, there are “neighboring” topics of interest. When it comes to security on a platform, even if you enumerate every possible setting of an application, you then have any external interfaces, interaction with user access controls, networking information, backup and recovery, and so much more.
When we develop content, we consciously decide whether it makes more sense to dive into a given topic or include another topic. We try to make the best decision possible given expectations, hitting on the thing we think will be most useful to the audience. That estimation may turn out to be incorrect, but it’s the method to the madness.
With this in mind, I typically classify things in one of a few buckets (with the numbers being representative of reality but not necessarily accurate gauges):
- 90-20 = 90% breadth, 20% depth. Think of this as an introductory course on a broad topic. I want you to know about pretty much everything the topic has to offer, but we won’t get into much detail. In the Azure Synapse Analytics space, that’s a training on what Azure Synapse Analytics has to offer. It’d probably take about 2 hours to get to 85-90% and 3-4 hours to get to 95% breadth. 20% depth means we tend to stay pretty shallow, describing what things do and demonstrating some of the highlights, but not really getting into a topic. Another example of this is my course The Curated Data Platform, which is just under 2 hours long and gives you an understanding of why there are so many data platform technologies and what they’re trying to do.
- 80-40. This is going to give you a lot of information and will dive in to cover many general scenarios, as well as some specific scenarios. At the same time, we purposefully exclude some content. For example, in the training, I explicitly exclude Data Explorer pools. A big part of the reason I exclude them is that they’re a preview feature (as of right now) and I don’t know what security will look like for them once it’s done. But there are some other features that I excluded from the training simply because I was already about 40-50% longer than “ideal” for a course. [Putting on my marketing hat: That’s right, sports fans, you get 40-50% additional free content with the course!]
- 20-80. In this case, we really narrow in on a topic. Advanced college courses get into the 20-80 space, moving from names like Introduction to Artificial Intelligence to Research in Convolutional Neural Networks. We purposefully ignore a lot of breadth and instead focus in on the details. Even so, depending on how large the topic is, we still don’t cover all of the edge cases.
- 5-90. Here, we get into some very tailored discussion. An example of this is my APPLY Yourself training, in which I spend 90+ minutes on CROSS APPLY and OUTER APPLY, which are themselves a tiny part of the whole T-SQL space.
- 1-95. This is the space for an academic thesis for a Master’s or a PhD. In most cases, people won’t enjoy this content unless you’re covering something very specific to their niche.
Note that 50-50 type courses aren’t on the list, and that’s on purpose. I think 50-50 courses tend to leave a lot of people unsatisfied. On the one side, it’s missing so much breadth that people looking for an overview will be disappointed and people who are already knowledgeable in the field will think something is amiss when they see you have not covered their particular pet topics. On the other side, people going into it expecting a lot of depth will be disappointed as well, as you get just enough to get them excited but not enough to make it worthwhile.
What you can do is have multiple trainings which act as supplements. For example, a 90-20 or 80-40 training might give a high-level overview of a system, and then a few 20-80 courses may be available for people who want to dive deeply into a topic. Also, note that these numbers are approximations and aggregates, so on any given topic, you have the opportunity to go deeper. In fact, except for the highest-level overview trainings, you definitely want to go a little deeper on some of the most important or foundational concepts. If I were putting together a 90-20 or 80-40 performance tuning training, I would of course spend a lot more time on indexes than discussing the merry-go-round scan, as the former will be a lot more impactful to the average DBA’s life than the latter (even if you meet the specific requirements for merry-go-round scans and find them to be a life-saver!). So don’t think of 80-40 as saying “I need to cut this off because I have spent exactly 3.6 minutes on this topic and I must spend 3.6 minutes on each topic.” There’s a lot more flexibility than that.
So What Does My 80-40 Solution Mean Here?
Tying that quasi-philosophical bit back to the new course, I want to describe what you get with Securing a Data Warehouse in Azure Synapse Analytics.
- Chapter 1 provides an introduction to the topic, giving you an idea of what Azure Synapse Analytics is, providing an overview of the business case we’ll handle throughout the training, and letting you meet the various personas. An important aspect of security is coming up with personas: representations of people who engage in specific roles and have certain personality traits and needs.
- In Chapter 2, we deploy out all of the pre-requisites for the training. This includes existing customer data in Azure SQL Database and Cosmos DB. We then plan out an architecture for how we will handle loading and processing data in Azure Synapse Analytics. I wrap up the chapter with discussions around regulatory compliance in Azure (spoiler: it’s a joint effort) and business continuity.
- Chapter 3 has us build a Synapse workspace. Because this is a course on security, we need to turn on (almost) all of the security options, including managed virtual networks and data exfiltration protection. I then spend some time talking about the networking side of things, as it’s easy to forget when dealing with a platform-as-a-service offering like Azure Synapse Analytics. I close the chapter with permissions and Synapse workspace role-based access control.
- Chapter 4 focuses on working in dedicated SQL pools, and looking at functionality built into the dedicated SQL pool itself. This includes things like transparent data encryption, row-level security, column-level security, role-based access controls (logins, users, and roles), and data discovery & classification opportunities.
- Then, we move on to Chapter 5, which covers the serverless SQL pool. The first half of the chapter looks at how to work with the serverless SQL pool, both with
OPENROWSET()
and PolyBase. The second half covers proper use of SAS tokens for accessing remote storage and a way of implementing poor man’s row-level security. - Chapter 6 follows up on this and looks at Spark pools, particularly via Synapse notebooks. I also take a look at Lake Databases and see how we can access data from Spark or the serverless SQL pool. We get a better understanding of what data exfiltration protection prevents us from doing, work with storage account permissions, and even take data from Cosmos DB and move it into our dedicated SQL pool via a Spark pool.
- Chapter 7 looks at Synapse pipelines. My goal in this chapter is to provide a brief overview of what pipelines do, show how they work, and also introduce Integration Runtimes as a concept.
- Finally, Chapter 8 includes information on a few additional tools we didn’t necessarily cover in detail. I start with Microsoft Defender for SQL (assuming they haven’t renamed it this week), which provides Advanced Threat Protection and vulnerability assessments. We look through some of what it has to offer, and then take a look at Azure Policy and some freely-available Synapse rules. I wrap up the course with a more detailed discussion on data exfiltration, show how you can use customer-managed keys to double-encrypt your data, and give a couple of hints on where to go next if you want to get beyond 80% breadth.
There is a lot in this course and if you are a security specialist, database administrator, or developer concerned with getting security right on a complex platform, I may be biased in this regard, but I think this is turned out to be a really good course.