(Audio)Book Review: The Ethics of Aristotle

It’s been a minute since I’ve done a book review, and I’ve never done an audiobook review, though as I write this, I’m reminded of the Space Ghost: Coast to Coast episode entitled Dam, in which Space Ghost interviews Charlton Heston:

Charlton Heston: Uh, you’re, you seem perfectly fluent in English, can you read?

Zorak: No.

Space Ghost: I like books on tape.

Charlton Heston: Oh, no no no no, we can do better than that, what about Shakespeare?

Space Ghost: What about books on tape?

Zorak: No.

Charlton Heston: No, nope. Shakespeare, that’s the best of them all. You know Shakespeare.

Zorak: Nope.

Space Ghost: Not personally.

Charlton Heston: No… (puts his hand to his face) You know the writings of Shakespeare.

Space Ghost: We didn’t have the theatre when I grew up, Chuck. We had hard work. Long days, mending the nets. Scaling the fish. No part of the fish was wasted, Chuck. We used the entire fish.

I think that puts us in the right context to appreciate Father Joseph Koterski’s course entitled The Ethics of Aristotle. Properly speaking, this is a set of lectures rather than a proper book, but it’s on Audible and this is my blog and these are my rules.

The Ethics of Aristotle was a very interesting listen, providing a lot of detail on topics such as moral excellence, following the classical virtues (courage, temperance, wisdom, and honor), concepts of justice, and friendship as Aristotle describes in his Nicomachean Ethics. Ultimately, Aristotle believes that the good life is a life of happiness, and it is our duty to understand what provides happiness. But Aristotle is clear to separate his notion of happiness (or the good life) with a Benthamite life of pleasure—happiness is not, strictly speaking, wealth or pleasure, but rather doing the right thing for the right reasons.

Speaking of friendship, Father Koterski spends two of his twelve lectures on the topic, noting just how much time Aristotle spends on the topic. We get to learn about Aristotle’s classification of various sorts of friendship and gain some understanding about why some friendships fail where others succeed.

In the classic Plato-Aristotle throw-down, I’m definitely going to side with Aristotle over Plato (though I do believe in Karl Popper’s statement that the level of advancement in various sciences is proportional to the level in which they’ve abandoned Aristotelian thinking), and Father Koterski’s lectures provide a good foundation for why. If you have an Audible account, this is definitely worth the credit. The one thing I wish were different, though, would be to eliminate the applause from the lectures. That noise is unnecessary and detracts from the experience.

Updates for the Big Data and Visualization MCW

Last week, I promised some changes to the Big Data and Visualization Microsoft Cloud Workshop. The bulk of these changes are now in development and I wanted to cover what some of the changes look like prior to their release.

Goodbye, Azure ML

In the current architecture, you can see that we use both Databricks and Azure Machine Learning.

All of the machine learning options.

We train a model using Databricks, save it, and use a notebook to move the model to Azure ML where we can serve it using Azure Container Instances for web application consumption. This is great, except that the Azure Machine Learning team has removed AzureML-PySpark-MmlSpark-0.15 from their list of curated environments and building your own environment to host Spark is…not something I wanted to cram into this lab. Instead, we’re going to take advantage of MLflow model serving on Databricks, which is rather convenient, except that there doesn’t appear to be an SDK option to enable it. Still, this does let me simplify the architecture a bit:

Airbrushed like a Moscow pro.

WTF, .NET

The Databricks REST API is similar to what Azure ML provided, but not quite the same. For example, the prediction model we built doesn’t return a confidence score, so the segment in Exercise 8’s web app which displays confidence is gone. The big change to this app, however, is an oddity in the way you call the Databricks REST API.

First, you need to use a Personal Access Token (PAT) to make calls. That’s not a big deal; you simply add a bearer token:

client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", pat);

I could put this as a header on the HttpContent call, but 100% of calls need the token so I figured I’d might as well just put it on the HttpClient. But the real sticky wicket was when I tried to call the Databricks API and got 500 errors because it wants the JSON in a specific format.

When building a Postman call, I successfully called the API with normal JSON syntax: { "Attribute":"Value", "Attribute2":"Value2" ... }.

Instead of this nice, easy to build JSON structure which works with Postman calls, something in the .NET call required switching the format to Pandas split DataFrame format: { "columns":["Attribute", "Attribute2", ...], "data":[["Value", "Value2", ...]] }.

This is not trivial to build in .NET. If I were going to do this for several calls and a lot of code, I’d probably end up using reflection to loop through the list of field names and build a mapper to reshape the data as arrays. As it was in the demo, I hard-coded the attribute names (knowing that they haven’t changed in several years) and used a method to perform all of that mapping work. It’s not beautiful by any means, but does serve the purpose.

One-Click Deployment, and Then More Clicks Later

The last big change to the lab is to add an ARM template to deploy the Databricks workspace, Azure Blob Storage account (and a container named sparkcontainer), and Data Factory. There’s still an optional step to create a VM which I left as step-by-step work rather than a deployment script because it’s optional—you can choose to deploy a VM or simply install the Integration Runtime on a Windows machine of your choice.

When Will This Be Available?

I am jumping the gun a little bit on announcing these changes, as they aren’t quite available yet. They, along with any other changes I make before the deadline, should be available sometime in November with this refresh. I’ll have an announcement when it is officially out, but now you have a sneak peek and can wait with bated breath. Or something.

Upcoming Events: Azure Community Conference

Key Details

What: Azure Community Conference
Where: On the Internet, UTC+5.5 (India Standard Time).
When: Friday, October 29th and Saturday, October 30th. Full-day, paid trainings on Sunday, October 31st.
Admission is free. RSVP on the Azure Community Conference website.

What I’m Presenting

7:30 AM — 8:30 AM EDT — Riding the Rails: Railway-Oriented Programming with F#

This is a talk I’ve wanted to put together for some time, and I’m glad I have the opportunity to give it here. This talk will, of course, be heavily indebted to Scott Wlaschin’s outstanding presentation and articles on the topic. I won’t stray too far from Scott’s explanations, though I do plan to simplify it a little and include some code examples to make things more apparent.

Upcoming Events: SQL Saturday Orlando

Key Details

What: SQL Saturday Orlando
Where: Orlando, Florida. Oh my goodness, it’s a real-life in-person event!
When: Saturday, October 30th.
Admission is free. RSVP on the SQL Saturday website.

What I’m Presenting

9:00 – 10:00 AM EDT — Saving Your Wallet from the Cloud

Not only is this an in-person event, it’s also a brand new talk from me. The talk will cover ways to save some scratch in the cloud by understanding how the pricing model works, estimating how much things cost, and developing cloud-specific utilization patterns to save money in the long run.

Avoiding NULL with Normalization

Aaron Bertrand sticks up for NULL:

A long time ago, I answered a question about NULL on Stack Exchange entitled, “Why shouldn’t we allow NULLs?” I have my share of pet peeves and passions, and the fear of NULLs is pretty high up on my list. A colleague recently said to me, after expressing a preference to force an empty string instead of allowing NULL:

“I don’t like dealing with nulls in code.”

I’m sorry, but that’s not a good reason. How the presentation layer deals with empty strings or NULLs shouldn’t be the driver for your table design and data model. And if you’re allowing a “lack of value” in some column, does it matter to you from a logical standpoint whether the “lack of value” is represented by a zero-length string or a NULL? Or worse, a token value like 0 or -1 for integers, or 1900-01-01 for dates?

I linked to this on Curated SQL, where I’d started to write out a response. After about four paragraphs of that, I decided that maybe it’d make sense to turn this into a full blog post rather than a mini-commentary, as I think it deserves the lengthier treatment. I’m going to assume that you’ve read Aaron’s post first, and it’s a well-done apologia in support of using NULLs pragmatically. I’ll start my response with a point of agreement, but then move to differences and alternatives before laying out where I see additional common ground between Aaron’s and my thoughts on the matter.

Token Values (Generally) Aren’t the Answer

Token values, like the ones Aaron describes, aren’t a good answer. And there’s a really good reason why: values represent something specific. An admission date of 2021-07-01 represents a specific fact: some person was admitted on 2021-07-01, and “2021-07-01” itself represents July 1, 2021. I’m belaboring this point because token values do not follow this same pattern. A person with an admission date of 1900-01-01 might represent this same pattern, but it might not. The problem is that there is an information sub-channel, where I need to know if the value represents the thing in itself, or if it in fact represents some other, unrelated notion. Quite frequently, a date of 1900-01-01 (or 1763-01-01 or 0001-01-01) means “I don’t know what the actual date was, but it was long enough ago that it shouldn’t really matter…I think.”

This is a problem for all of the reasons Aaron describes: there is typically no information on the fact that an information sub-channel exists, no clear representation of what each datum represents (is 1900-01-02 different from 1900-01-01 here?), and the risk of problems down the line if people don’t know that special cases exist.

One area where token values can be the answer is in data warehousing, where we might use well-known sentinel values (such as 1900-01-01 or 9999-12-31) to avoid NULLs in data warehousing queries and handle type-2 slowly changing dimensions. But in most places, this is a workaround rather than a solution, and not a good one.

Embrace Normalization

The proper answer here is normalization, specifically 6th Normal Form. Given that I need two hands to count up to 6th Normal Form, that’s a pretty big number. Fortunately, I happen to have a talk which goes into detail on database normalization, so I can swipe a slide from it:

Well, if that doesn’t clear everything up, I don’t know what will!

Okay, so what does this mean? Fortunately, I have an example on the next page. Suppose that we have some table (okay, relvar, but let’s not spend too much time on semantics here) with a few attributes: Product = { ProductKey, ProductName, ProductColor, ProductWeight, StorageCity }. Our natural key is ProductKey, and each attribute describes a fact about the specific product. Putting this into 6th Normal Form would mean creating four separate tables/relvars:

ProductKeyProductName = { ProductKey, ProductName }

ProductKeyProductColor = {ProductKey, ProductColor }

ProductKeyProductWeight = {ProductKey, ProductWeight }

ProductKeyStorageCity = { ProductKey, StorageCity }

Now, before you run me out of town, let’s stipulate that I don’t recommend doing this, just that if you wanted to be in 6th Normal Form, this is what you’d have to do.

What I do recommend, however, is that we use 6NF to avoid NULLs. Suppose that we might not always have a product weight, but that we always have the other attributes. Instead of making ProductWeight NULLable, we can instead create two tables:

Product = { ProductKey, ProductName, ProductColor, StorageCity }

ProductWeight = { ProductKey, ProductWeight }

That way, if we do have a product weight, we insert a row into the table. If we don’t have a product weight, we don’t insert a row.

Using Aaron’s example where EnteredService is not always there, we have two tables. Here are the two tables:

CREATE TABLE dbo.Widgets
(
    WidgetID     int IDENTITY(1,1) NOT NULL,
    SerialNumber uniqueidentifier NOT NULL DEFAULT NEWID(),
    Description  nvarchar(500),
    CONSTRAINT   PK_W6NF PRIMARY KEY (WidgetID)
);

CREATE TABLE dbo.WidgetEnteredService
(
    WidgetID int NOT NULL,
    EnteredService datetime,
    CONSTRAINT PK_WES PRIMARY KEY (WidgetID)
);

When you have a relatively small number of NULLable columns, this is a great solution.

How Does It Perform?

I think the above solution is better from a relational theory standpoint, but let’s do a quick performance comparison. As a note, my priors on this are that this will not always be great for performance, especially if you have several 6NF tables connected to your base table. Furthermore, because I’m not doing a comprehensive test of all possible scenarios (or even a reasonable subset of all interesting scenarios), if it does perform better, I’m not going to claim that the 6NF solution is better because of performance. Instead, my argument is based on relational architecture over performance. But if it runs like a dog, that’s a strong argument in favor of Aaron’s more pragmatic approach. So let’s give it a go and find out, shall we?

Here’s the setup script I used, adding in my tables and keeping things as similar to his setup as I could.

CREATE TABLE dbo.Widgets
(
    WidgetID     int IDENTITY(1,1) NOT NULL,
    SerialNumber uniqueidentifier NOT NULL DEFAULT NEWID(),
    Description  nvarchar(500),
    CONSTRAINT   PK_W6NF PRIMARY KEY (WidgetID)
);

CREATE TABLE dbo.WidgetEnteredService
(
    WidgetID int NOT NULL,
    EnteredService datetime,
    CONSTRAINT PK_WES PRIMARY KEY (WidgetID)
);

CREATE TABLE dbo.Widgets_NULL
(
    WidgetID     int IDENTITY(1,1) NOT NULL,
    SerialNumber uniqueidentifier NOT NULL DEFAULT NEWID(),
    Description  nvarchar(500),
    CONSTRAINT   PK_WNULL PRIMARY KEY (WidgetID)
);
 
CREATE TABLE dbo.Widgets_Token
(
    WidgetID     int IDENTITY(1,1) NOT NULL,
    SerialNumber uniqueidentifier NOT NULL DEFAULT NEWID(),
    Description  nvarchar(500),
    CONSTRAINT   PK_WToken PRIMARY KEY (WidgetID)
);

INSERT dbo.Widgets_NULL(Description) 
OUTPUT inserted.Description INTO dbo.Widgets_Token(Description)
SELECT TOP (100000) LEFT(OBJECT_DEFINITION(o.object_id), 250)
FROM master.sys.all_objects AS o 
    CROSS JOIN (SELECT TOP (50) * FROM master.sys.all_objects) AS o2
WHERE o.[type] IN (N'P',N'FN',N'V')
	AND OBJECT_DEFINITION(o.object_id) IS NOT NULL;

INSERT INTO dbo.Widgets(Description) SELECT Description FROM dbo.Widgets_NULL;

ALTER TABLE dbo.Widgets_NULL  ADD EnteredService datetime;
ALTER TABLE dbo.Widgets_Token ADD EnteredService datetime;
GO
 
UPDATE dbo.Widgets_NULL  
SET EnteredService = DATEADD(DAY, WidgetID/250, '20200101') 
WHERE WidgetID > 90000;
 
UPDATE dbo.Widgets_Token 
SET EnteredService = DATEADD(DAY, WidgetID/250, '20200101') 
WHERE WidgetID > 90000;
 
UPDATE dbo.Widgets_Token 
SET EnteredService = '19000101'
WHERE WidgetID <= 90000;

INSERT INTO dbo.WidgetEnteredService(WidgetID, EnteredService)
SELECT
	WidgetID,
	EnteredService
FROM dbo.Widgets_NULL
WHERE
	EnteredService IS NOT NULL;

CREATE INDEX IX_EnteredService ON dbo.Widgets_NULL (EnteredService);
CREATE INDEX IX_EnteredService ON dbo.Widgets_Token(EnteredService);
CREATE INDEX IX_EnteredService ON dbo.WidgetEnteredService(EnteredService);

Here are the size totals for each of the three states: initial table loading, after adding EnteredService dates, and after adding an index on EnteredService.

Note that I do have complete agreement with Aaron’s values for the first two sets, though my “After index” sets are ever so slightly larger than his. It’s not enough to make a difference in the analysis, though.

Running Aaron’s sample query for all three tables, we can see that the 6NF form is just about the same as the NULL form:

In order, Token, 6NF, NULL.

If we change the query a bit to include serial number and not aggregate, we end up with a query like:

SELECT WidgetID, SerialNumber, EnteredService
FROM dbo.Widgets_NULL 
WHERE EnteredService <= '20210101';

SELECT wes.WidgetID, w.SerialNumber, wes.EnteredService
FROM dbo.WidgetEnteredService wes
	INNER JOIN dbo.Widgets w
		ON wes.WidgetID = w.WidgetID
WHERE wes.EnteredService <= '20210101';

In this case, both have durations of 3ms on my machine and both take 5607 reads (well, the NULL version takes 5608, but close enough). That’s because both of them ultimately have very similar-looking query plans:

Always with the nested loops.

In fairness, the NULLable version of this has a benefit that we won’t get from the 6NF version: if we know we include SerialNumber on queries, we can include that in the index and remove the key lookup, making this an 11-read operation versus 5607 reads. This says that yes, you should expect a degradation in performance if you rely on the NULLable columns for filtering and you can create covering indexes.

For other types of queries, the answer is a bit murkier. For example, pulling rows over a range filtered by WidgetID will give you differing results based on the size: for smaller ranges, the 6NF form may end up with fewer reads; for larger ranges, the NULLable version may end up with fewer reads. For point lookups (WidgetID = x), it takes 5 reads for 6NF and 3 for NULL based on these row counts.

SELECT WidgetID, SerialNumber, EnteredService
FROM dbo.Widgets_NULL 
WHERE WidgetID > 95000;

SELECT wes.WidgetID, w.SerialNumber, wes.EnteredService
FROM dbo.Widgets w
	LEFT OUTER JOIN dbo.WidgetEnteredService wes
		ON wes.WidgetID = w.WidgetID
WHERE w.WidgetID > 95000;

In short, it’s a mixed bag but I’m willing to say that 6NF will probably perform worse, especially on very large datasets.

When to Compromise?

In this last section, I want to talk about cases where I do agree with Aaron and have no real problem using NULLable columns. The relational model is a beautiful thing, but implementations aren’t perfect and we do need to keep performance in mind, so “No NULLs ever” isn’t a stance I can support. Also, note that I came up with this list late at night and I’m probably missing some cases. So think of it as illustrative rather than comprehensive. Sort of like the whole post, now that I think about it…

Performance-Critical Tables with Frequent Use of NULL Columns

The first example fits in with what we saw above. If you constantly query the Widgets table by EnteredService and you need the absolute best performance, then yes, making this one single table can make sense. But I want to emphasize that this is performance-critical tables, and frankly, I’d lay out the claim that query performance is rarely a critical need—there may be a few tables in an environment which absolutely need supporting queries to run as fast as possible, but for most other tables and queries, “good enough” is a fair distance from “absolute fastest possible.” Mind you, I’m not talking about completely ignoring performance or being okay with OLTP queries taking 5 minutes; instead, we’re talking about 3ms with 5000 reads vs 3ms with 11 reads. If you run this query a few million times per day, then yeah, use the NULL. If you run the query a few hundred times per day (or run other variants in which the difference is smaller or even in 6NF’s favor), I don’t think that’s a real deciding factor.

Staging Tables

Staging tables are a good example of a case where I’d happily use NULL, and that’s because I have no idea what kind of garbage I’m going to get from source systems. I need to develop ETL processes with garbage inputs in mind, and that means dealing with missing values or bad data that I might need to drop. Also, I might merge together data from multiple sources in separate chunks, loading 4 columns from a dimension based on this source, 3 from that source, and 5 more from yet another source. I don’t want to try to connect all of these together (perhaps for performance reasons), and although I could solve the problem without NULL, it’s pretty easy when you let the staging tables be NULLable. But when going from staging tables to real tables, I have much more control over the domain and this reason goes away.

Temp Tables

Similarly to staging tables, I don’t typically care about NULL values in temp tables. They’re not permanent tables and they’re typically intended to help work on a problem piecemeal. NULL away in these temp tables unless you need the benefits of NOT NULL (or need to add a primary key or something which requires non-NULL attributes).

Conclusion

Do you need NULL in your database? Nope. Using 6th Normal Form can eliminate the need for NULLs, but understand the implications of it.

Am I going to be insulted that you have NULL in your database? Nah.

Do I create tables with NULL? You betcha.

Should I have created all of those tables with NULL? Um…let’s not answer that one.

What Makes a Good Visualization?

This post has been on my backlog for about a year, but I’m finally getting to it. That’s the power of backlogs!

Xenographics

The Xenographics website provides a number of creative ways of visualizing data. Some of them are great. Others are…not. So I decided to review a few and try to explain my reasoning along the way. I definitely won’t review every one, but let’s look at some of these.

Bracket Probabilities

Let’s start with one I like: bracket probabilities.

What I like about this is that it’s a really good “newspaper” graphic. In other words, this is a visual which works well in explaining probabilities to the general public. It shows the probability of victory for each team across each game, includes a few notes on key events, and explains a lot of information, though in a fairly large amount of space.

Manhattan Plots

Next up is the Manhattan Plot:

This is another plot I like, though it’s more of a “business” graphic than a “newspaper” graphic. This is sort of like a histogram with a large number of categories.

Multi-Class Hexbins

Now we’re going to look at one I’m not as fond of: multi-class hexbins.

Because it’s something I don’t like, let me dive into a bit about why. The idea here is to show differences over an area, such as voting for one of two political candidates by district. It does so with a series of hexagonal pie charts. Pie charts are typically not a very good type of visual for a few reasons:

  • They take up a good amount of space
  • Humans have a difficult time comparing angles, especially angles with small differences
  • Once you get past 2 or 3 elements, you greatly increase the likelihood that you’ll end up with tiny slivers of pie chart which become almost impossible to read with a legend

Whenever you want to use a pie chart, there is usually a better visual available. In this case, I’d prefer a heatmap with gradients from orange to purple. It’d be hard to tell exact numbers with a heatmap as well, but this gives you a feeling of false precision that a heatmap doesn’t.

A Few More I Like

I’m a big fan of UpSet plots, which provide a lot of information in a relatively compact space.

The problem with UpSet plots is that it takes some getting used to. Laura Ellis has a great post on how UpSet plots work, but the short version of this is taht the bar chart on the left indicates number of Twitter followers per person. The dot-and-line plot at the bottom indicates combinations and the column chart at the top tells how many people fall into each category. For example, 634 people follow all of the Twitter accounts, and 10,864 follow @dataandme but none of the others.

There are a couple of dot-style plots that I like as well: the Raincloud plot and the Dot-boxplot. These provide you the relevant information in a dot plot (including jitter, which is really important as your dataset gets dense) along with a broader signifier of density.

Defining Important Characteristics

In an episode of Shop Talk a year ago, I talked in detail about six characteristics that I think are important when choosing a visual. They are as follows:

  • Intuitive — A visual should be easy for a person to understand despite not having much context. In some cases, you have the opportunity to provide additional context, be it in person or in a magazine. That lets you increase the complexity a bit, but some visuals are really difficult to understand and if you don’t have the luxury to provide additional context, it makes your viewer’s job difficult.
  • Compact — Given two visuals, the one which can put more information into a given space without losing fidelity or intuitiveness is preferable. This lets you save more screen real estate for additional visuals and text. There are certainly limits to this philosophy, so consider it a precept with diminishing marginal returns.
  • Concise — Remove details other than what helps tell the story. This fits in with compactness: if you have unnecessary visual elements, removing them lets you reclaim that space without losing any fidelity. Also, remove unnecessary coloration, changes in line thickness, and other things which don’t contribute to understanding the story. Please note that this doesn’t mean removing all color—just coloration which doesn’t make it easier for a person to understand what’s happening.
  • Consistent — By consistency, what I mean is that the meaning of elements on the visual does not change within a run or between runs. Granted, this is more relevant to dashboards than individual visuals, but think about a Reporting Services report which uses default colors for lines on a chart. If you refresh the page and the colors for different indicators change, it’s hard for a person to build that mental link to understand what’s happening.
  • Glanceable — Concise and consistent visuals tend to be more glanceable than their alternatives. Glanceable means that you are able to pick out some key information without needing to stare the the visual. Ideally, a quick glance at a visual tells you enough of what you need to know, especially if you have seen the same visual in prior states.
  • Informative — This last consideration is critical but often goes overlooked. The data needs to be useful and pertinent to users, describing the situation at the appropriate grain: it includes all of the necessary detail for understanding while eschewing unnecessary detail.

If you want to see that episode, here it is below:

Conclusion and Hand-Waving

As I wrap up this post, I do want to mention that context means a lot. If you’re dealing with an audience which is intimately familiar with UpSet plots, they might not think the plot difficult to understand at all. But if that’s the first time you’ve seen the plot in your life, it’s not going to be easy to figure out. If I’m in a situation in which I can provide additional information, either by explaining it in person or adding informative notes in a presentation deck, then I don’t have a problem moving ahead with it. But if I won’t get that opportunity to explain the chart in detail, I think I might try to pick something simpler to ensure that my audience will get it. There’s nice value in some of the complex charts, but the most important thing is to remember your audience and have a good understanding of their capabilities and experience.

The Value of Microsoft Cloud Workshops

Microsoft Cloud Workshops are free, step-by-step guides for implementing solutions in Azure. As part of the work I do with Solliance, I’ve assisted in creating and maintaining several MCWs in collaboration with Microsoft.

Two MCWs in particular that I’ve been working on updating right now are Innovate and modernize apps with Data and AI and Big data and visualization. Both of these cover a variety of Azure technologies and show you how to combine together services to solve an overriding business problem.

Innovate and Modernize Apps with Data and AI

Innovate and modernize apps with Data and AI is the most complex MCW I’ve worked on. It combines database development using PostgreSQL and Cosmos DB, machine learning with Azure ML, integration between Azure Synapse Analytics and Cosmos DB via Cosmos Link, event handling with IoT Hub and Event Hub, containerized microservices following the CQRS pattern, and plenty of Azure Functions to glue it all together.

This is a lot of services. Also, the image is going to be updated to simplify things.

In this next round of updates, we’re going to re-work the process and remove PostgreSQL from the solution. We’ll also simplify the data loading process, using Event Hubs instead of going through the deployment process for an IoT Hub. That’s a bit of a shame, as I really like the IoT Hub process we created. But them’s the breaks.

Big Data and Visualization

Big data and visualization is an MCW I’m currently updating. It’s built primarily off of Azure Databricks and Power BI, but incorporates several other services including Azure Data Factory and Azure SQL Database. One of the changes I’m going to make will actually simplify the process by removing Azure Machine Learning from the lab, instead taking advantage of Databricks experiments and its REST API for ML models.

This MCW drives off of Azure Databricks and Power BI, with several supporting services.

More Information and a Request

If you want to see the full set of Microsoft Cloud Workshops, be sure to check out these free, advanced resources at http://aka.solliance.net/mcw.

Also, if you’ve ever used an MCW before, please provide some feedback with the following survey: https://forms.office.com/r/834zwtaNtK. This survey will help drive the future of the MCW program, so we’d really appreciate you filling it out.

First Experiences with the Framework Laptop

I’ve had the Framework Laptop for a couple of days, so I wanted to put together a few thoughts on it. I’ll start with some build images, move into experiences running Linux on the laptop, and wrap up with some general thoughts.

Unboxing and the Build Process

The box arrived on Thursday:

One laptop box with a judiciously-covered receipt.

The power adapter does have a square brick shape, but I’ve been pleased with the lengths of the USB-C cable and the A/C adapter cable. I haven’t had to give them a real test yet, but the cable length is definitely longer than what I currently have with my Yoga, and the big brick is in the middle, not the end. This has the salutary effect of not taking up several spots in a surge protector.

I ordered the DIY edition and it arrived with the WiFi card, RAM, NVMe disk, and expansion cards.

Everything all lined up and ready for installation.

Based on this image, it’s pretty obvious that I’m going to have to install these myself. By the way, one thing that was a little surprising was just how small and light the expansion cards are.

The laptop itself is about the same size as my existing Lenovo Yoga—roughly the same dimensions and weight. It’s close enough that I’m already used to it. This laptop also comes with the only tool you’ll need for any repair operation: a combination spudger and T5/Phillips head screwdriver.

Spudge away.

The bottom of the laptop shows the four expansion card slots that are available in the laptop.

The back of the laptop.

Zooming in on the expansion card slots, they’re actually USB-C slots. I have 5 cards in total: two USB-C, two USB-A, and one HDMI. These are hot-swappable, so you can pop one of these cards out and put in a different one to continue.

Here’s an example of one of the cards:

From USB-C to USB-C. That’s how I roll.

Putting together the laptop was straightforward on the whole. The interior is clean and there are QR codes for each major element, leading you to a repair and installation guide.

The only thing I had any trouble with was the WiFi card installation, and that’s because I’d never done it before. Installing the card was easy: connect the two antenna cables to the adapter, make sure you hear a “click” when each snaps in.

Working with Linux

Everything worked well on Linux except the WiFi card. Elementary OS 6 is built off of Ubuntu LTS 20.04, which is great except that it comes with a version of the Linux kernel which has a regression with the WiFi adapter I’m using. In fairness to elementary OS, my Windows 10 boot USB also didn’t include the appropriate driver. To fix this on Ubuntu or elementary OS, check out these instructions. Download the appropriate driver, delete the pnvm file, and go on with life. I have also seen some issues where the pnvm file gets re-created after installing server updates, so you might want to keep that handy.

Also, I’d recommend adding the following to your .profile file to support 3000×2000 resolution, particularly if you’re using elementary OS:

# Elementary OS sizing
# https://community.frame.work/t/using-elementary-os-on-the-framework-laptop/4453
xrandr --newmode "3000x2000_60.00"  513.44  3000 3240 3568 4136  2000 2001 2004 2069  -HSync +Vsync
xrandr --addmode eDP-1 "3000x2000_60.00"

I haven’t tried playing any graphics-intensive games, though I’m not expecting any great shakes given the integrated chipset.

General Thoughts

The build quality on this laptop so far is nice. Everything feels well put together, the keyboard is pretty good (though I’m used to mechanical keyboards, so it’s always going to be a step down). I’m pretty happy with the overall quality of everything, but again, I haven’t had enough time with it to find any big problems.

One thing of note is that the USB slots are really tight, both the USB-A and USB-C. Once you plug in a cable, it’s not going anywhere.

The camera is a pretty decent laptop webcam. It won’t beat a 4K external camera, but it serves its purpose. More importantly, it’s located at the top of the laptop and not the bottom, so you don’t get the “finger cam” of many modern laptops. Most importantly, both it and the microphone have dedicated hardware kill switches. This is way better than a software kill switch, as it means the camera or microphone doesn’t even receive power and physically cannot turn on when it’s in the off position.

Kill switches for the microphone (left) and camera (right).

Speaking of the microphone, it’s okay but no great shakes. About what you’d expect from a laptop microphone. Fortunately, plugging in a Jabra works just fine in Linux, so for any real conversations, I’ll switch to it or a proper dedicated microphone.

Battery life is acceptable. When I’m pushing the laptop, like when I was setting up OBS Studio, I get about 2 hours on Linux. When writing blog posts or doing other non-intensive work, battery life is more like 3 hours, maybe a little more. It’s definitely not an all-day battery, but good enough when I’m working between places. For longer trips, I bought a separate battery pack which would suffice for the better part of any flight…assuming long flights ever happen again…

Outliers, Anomalies, and Noise

This is part of the series Finding Ghosts in Your Data.

For the inaugural post in this series, I want to spend a few moments on terminology. What’s the difference between an “outlier” and an “anomaly”? The interesting thing is that there’s no clear delineation in the literature. As a quick example, Mehrotra, et al, define outliers and anomalies as the same thing: “substantial variations from the norm” (4), although they do point out in a footnote that you can think of anomalies in processes and outliers in data. By contrast, Aggarwal has a definition of outliers and anomalies that I prefer (1-4). Here is a quick summation of the difference.

Define Your Terms

An outlier is something significantly different from the norm. For example, in the following image, we can see our data follow a consistent distribution everywhere except for something on the edge of the graph. Those data points are outliers.

Data which follows a sort-of-normal distribution, except for the data which doesn’t.

Within the set of outliers, we can differentiate two classes: anomalies and noise. Anomalies are outliers which are interesting to us, and outliers which are not interesting to us are noise.

Isn’t That Interesting?

This leads to an important question: how do we determine if something is interesting or not? There’s a lot to this question and I won’t be able to answer it conclusively here—in fact, it’s an answer which is going to take the majority of the book to sort out—but let’s lay out a few pointers and take it from there.

Different Models

Our first point is actually another definition of anomalies: anomalies can be defined as data which appears to be generated from a different model than the rest of the data. Looking at the image above, we can easily imagine that the bulk of our data comes from some operation which follows a normal distribution. But that other data is approximately 8 standard deviations from the mean. To give you an idea of how likely it is that we’d get a point that far out, there’s a 1 in 390,682,215,445 chance that we’d get something at 7 standard deviations from the mean. Having multiple data points 7 or 8 standard deviations from the mean is so improbable that in reality, those data points probably didn’t come from the same process which generated the rest, and the fact that they come from a different process is interesting, therefore making those outliers anomalous.

Distance from Normality

I also snuck in a second conception of what makes an outlier anomalous: distance from normality. The idea here is that the further away a thing is from the norm, the more likely it is to be an outlier. That’s because our processes are probabilistic. If I say that a golfer can sink this putt from 8 feet, what I mean is that the golfer will hit the ball and it will end up in the hole with some fairly high probability, but not always. The ball might end up a few inches away from the hole with some probability, or it could end up a few feet from the hole with some smaller probability. Therefore, if I pluck one example, drop a golfer in to make an 8′ putt, and the ball ends up 5″ away from the hole, missing the putt is an outlier event, but it’s still within the realm of expectations. From our standpoint of analysis, this is noise. If the ball ends up 50′ away, there’s a totally different story in here.