Back to Blogging

It’s been a hot minute since I’ve regularly blogged, but I’m working on getting back into it. Let’s see a few of the things I’ve been up to lately, where “lately” is, oh, most of this year…

Trainings

I’ve launched a couple of trainings on Teachable, one on the APPLY operator and one entitled the Curated Data Platform. The goal of these trainings is to provide in-depth training on topics at reasonable prices. For now, I’m probably not going to develop any more of these trainings, as they take a lot of time to put together and the ROI isn’t there today.

Another Book

Speaking of “the ROI isn’t there,” I’ve agreed to write a book on anomaly detection in Python. The working title of the book is Finding Ghosts in Your Data: Anomaly Detection Techniques with Examples in Python. I definitely was not anticipating writing a second book after PolyBase Revealed, but I have spent a lot of time in the world of anomaly detection the last few years, and it’s an area where I think I can make a contribution. Most books on anomaly detection tend to have the feel of textbooks: heavy on the statistical and mathematical underpinnings of techniques, but light on implementation. The goal of Finding Ghosts in Your Data is to straddle the line between academic work and tutorial. I’ll still get into a lot of detail on anomaly detection techniques, but the intended audience for this book is a software developer who has forgotten most of his statistics course from university days.

I also intend to do a fair amount of blogging on the book as I write it. I won’t give away the whole thing, but I will share a lot along the way.

New Talks! Some of Them In Person!

Right now, I’m in the midst of developing four new talks, all of which have to be done before the end of the month.

Keeping It Classy: Designing a Great Classifier and Building Your First Data Pipeline in Apache Spark are going to debut at the PASS Data Community Summit as part of two separate learning paths. The first talk provides a solid foundation for what a classification algorithm is in the data science world, different types of classification algorithms, and when you might choose one over the others. I cover a variety of tree-based (e.g., CART, random forest, XGBoost) and non-tree (e.g., kNN, Naive Bayes, Passive-Aggressive) algorithms, explain at a high level how they work, and show how you can work with them using libraries like scikit-learn.

The second talk, meanwhile, provides an introduction to Apache Spark by way of Azure Databricks. In it, I’ll cover the basic details of what Apache Spark is, how Databricks fits into it all, and how we can create data pipelines. Trust me when I say that I stretch the pipeline metaphor as far as it goes, and maybe a little further.

Riding the Rails: Railway-Oriented Programming with F# is the third talk I’m currently working on. It follows the excellent Scott Wlaschin’s Railway-Oriented Programming metaphor and talk, and I plan to give it my own spin by including more code in the talk itself. The cost of focusing more on the code is a loss of some of the depth of discussion that Scott hits, but I hope that trade-off is worthwhile, as I really like the ROP metaphor / Either monad. You can find this at the Azure Community Conference.

Finally, the fourth talk debuting this month is entitled Saving your Wallet from the Cloud, and it is intended to serve as a way of understanding how pricing in the cloud works and different methods you can choose to slice that bill. This one will debut for SQL Saturday #1021 in Orlando.

I’ll probably have a blog series for each talk over the next couple of months, once the time constraints have softened a bit.

Microsoft Cloud Workshops

One of the things I do for Solliance is create and maintain Microsoft Cloud Workshops. Right now, I have two on my plate: Big data and virtualization, and Innovate and modernize apps with Data and AI. Both of them have updated scheduled, and they’re both pretty big ones.

DataCamp Courses

A few months ago, I was the subject matter expert for DataCamp’s Data Modeling in Power BI course. You won’t see or hear me there, but I shaped the course design, developed most of the content, and handed all of that off to DataCamp folks so that I don’t have to think about it any longer…

I’m currently doing the same on a course around data visualization in Power BI. This course should be particularly interesting because I’m combining psychological concepts (knowing your audience, getting an emotional response, reducing cognitive load, tracking focal points, etc.) with a grand overview of most Power BI visuals, including custom visuals and Python/R visuals. In addition to those, there’s an entire lesson on designing for accessibility. For approximately 4-6 learner hours of training, there’s a lot of content packed in there. I’m about 2/3 of the way through this course, so we’ll probably see it release in December.

More on the Plate

There’s a bit more that I’m working on as well, but by this point, I’m now convinced I live on a planet with 36-hour days…or I’m over-booked, one of the two.

The Framework Laptop

About a month ago, I pre-ordered the Framework Laptop, specifically the DIY edition. This laptop isn’t exactly what I want in a laptop, but it does have a lot going for it, so I figured I would write up a summary of why I went with this one, especially as it’s expected to arrive tomorrow.

The Bottom Line: Right to Repair

Right to repair isn’t a topic I’ve discussed much, well, ever, but it’s an important issue for reasons I covered in the most recent episode of Shop Talk. I also plan to have a lengthier write-up sometime soon which covers my thoughts on the topic, but the really short version is that I want to maximize options for choice when it comes to property I own, and right to repair extends the sphere of possible choices.

The Framework team has put a lot of effort into making their laptop repairable and upgradable. They have a series of support guides which walk you, step by step, through the process of component replacement. For example, if you need to replace your mainboard, there’s a guide for that.

They’re also making schematics available to third-party repair shops, a rarity in the computing world. The sad part is that, several decades ago, schematics tended to be included in the giant product manuals for hardware; now, we consider it laudable that a company doesn’t consider this top-secret information.

The Top Line: Good Specs at an OK Price

When making my decision, I knew that I could find a similarly-spec’d laptop at a lower price, but that bottom line is worth a fair amount to me. In case you’re curious about the specs I chose, here goes:

  • Intel i7-1165G7. They offer an option with the i7-1185G7 but I don’t think the tiny performance difference is worth the big price difference.
  • 2 TB Western Digital SN750 NVMe
  • 64 GB DDR-3200 RAM. To date, my laptops have all been 16 GB of RAM, which works fine on its own, but once I feel the urge to spin up a Kubernetes pod or a few Docker containers, that RAM disappears fast.
  • HDMI expansion card, 2 USB-C expansion cards, 2 USB-A expansion cards. One of the most clever choices the Framework team made was the modular design for their expansion cards. The laptop has four expansion card bays and because the expansion cards are really just USB-C ports, they’re hot-swappable. There’s also the possibility in the future of additional expansion card types fitting this common form factor, extending the lifespan of this laptop.

The Middle Line: It’s Not Perfect

When purchasing any laptop, you’re going to make a series of trade-offs. You can get desktop computer power and a full-sized keyboard with number pad if you’re willing to haul around a 17″ monstrosity which weighs more than a newborn and acts like a space heater. If you want something extremely light (like I prefer), you’re typically going to settle for a mediocre keyboard, 12-13″ of screen space, and a limited amount of RAM.

All of this is to say that the Framework Laptop is a compromise option, especially considering that this is a new startup hardware vendor, so they’re only going to have a couple options available. Yeah, it’d be nice to have an AMD chipset, a touchscreen, a monitor which supports higher resolutions, and a keyboard with the four keys (Home, End, Page Up, Page Down) as separate keys. But going back to the bottom line: unlike with other laptops, there’s actually a chance that I can resolve each of these over time. The Framework Marketplace has launched and although it only has the DIY options today, there’s the possibility of new monitor varieties, swappable keyboards, and more in the future. Sure, some of this may be a “2 years later” scenario but that’s still considerably better than I could ever hope for with any other laptop.

The Penultimate Line: Linux on the Laptop

One other factor I haven’t mentioned is that I’ve been itching to put Linux on my primary laptop for a while. I’ve avoided this mostly because of its atrocious support for touchscreens and my love of the same. But without a touchscreen option, I decided to take another dive. In case you’re curious, I’m planning on using elementary OS, a distro built off of Ubuntu that I’ve used in the past and enjoyed. Yeah, I could use Ubuntu 21.10 (scheduled to release the same day I get my laptop, so what could possibly go wrong?), but one of the co-founders of elementary OS is a Framework Laptop user and has a great post covering setup and I enjoyed elementary the last time I had it running on a laptop.

The Conclusion

No computer is going to be perfect, and laptops are particularly trade-off heavy. That said, I was happy enough with the options available with the Framework Laptop—and impressed enough with their stance in favor of right to repair—that I decided to take the plunge. I’d like to see more companies move toward making schematics available and making repair options easy, so if all other things are close enough to equal, I’ll go with the repair-friendly company over the repair-resistant company.

New Training: the Curated Data Platform (Free for a Limited Time)

I am pleased to announce a new course: The Curated Data Platform.

Given two hours of footage, you’d think I would find a spot where I smiled.

A Brief Summary

The Curated Data Platform is a 2-hour video training aimed at providing you a 30,000 foot overview of the data platform space. In this course, I take you through a variety of data platform technologies—such as relational databases, document databases, caching technologies, data lakes, and graph databases. I show you use cases in which these technologies can be great fits, as well as which companies and products are most relevant in that space today. This includes on-premises technologies as well as major services in Amazon Web Services and Azure.

Pour one out for Riak here.

Get This Course for Free! (Limited Time Offer)

Through Sunday, July 25, 2021, you can register for this course for free using the coupon code FIRSTMOVER when you check out. My one request with this is, if you use the coupon code, please be sure to leave feedback on the course—things which you liked, as well as things you wanted to see but didn’t. I intend to update this course over time to make it better based in part on learner feedback.

Winning at Pong via Reinforcement Learning

I finally got around to trying out a reinforcement learning exercise this weekend in an attempt to learn about the technique. One of the most interesting blog posts I read is Andrej Karpathy’s post on using reinforcement learning to play Pong on the Atari 2600. In it, Andrej uses the Gym package in Python to play the game.

This won’t be a post diving into the details of how reinforcement learning works; Andrej does that far better than I possibly could, so read the post. Instead, the purpose of this post is to provide a minor update to Andrej’s code to switch it from Python 2 to Python 3. In doing this, I went with the most convenient answer over a potentially better solution (e.g., switching xrange() to range() rather then re-working the code), but it does work. I also bumped up the learning rate a little bit to pick up the pace a bit.

The code is available as a GitHub Gist, which I’ve reproduced below.

import numpy as np
import pickle
import gym
# hyperparameters
H = 200 # number of hidden layer neurons
batch_size = 10 # after how many episodes do we do a parameter update?
learning_rate = 3e-4
gamma = 0.99 # discount factor for reward
decay_rate = 0.99 # decay factor for RMSProp leaky sum of grad^2
resume = False # resume from prior checkpoint?
render = False
# model initialization
D = 80 * 80 # input dimensionality: 80×80 grid
if resume:
model = pickle.load(open('save.p', 'rb'))
else:
model = {}
model['W1'] = np.random.randn(H,D) / np.sqrt(D) # "Xavier" initialization
model['W2'] = np.random.randn(H) / np.sqrt(H)
grad_buffer = { k : np.zeros_like(v) for k,v in model.items() } # update buffers that add up gradients over a batch
rmsprop_cache = { k : np.zeros_like(v) for k,v in model.items() } # rmsprop memory
def sigmoid(x):
return 1.0 / (1.0 + np.exp(x)) # sigmoid "squashing" function to interval [0,1]
def prepro(I):
""" prepro 210x160x3 uint8 frame into 6400 (80×80) 1D float vector """
I = I[35:195] # crop
I = I[::2, ::2, 0] # downsample by a factor of 2
I[I == 144] = 0 # erase background (background type 1)
I[I == 109] = 0 # erase background (background type 2)
I[I != 0] = 1 # everything else (paddles, ball) just set to 1
return I.astype(np.float64).ravel()
def discount_rewards(r):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(range(0, r.size)):
if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (specific to Pong!)
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
return discounted_r
def policy_forward(x):
h = np.dot(model['W1'], x)
h[h<0] = 0 # ReLU nonlinearity
logp = np.dot(model['W2'], h)
p = sigmoid(logp)
return p,h # return probability of taking action 2, as well as hidden state
def policy_backward (eph, epdlogp):
""" backward pass. (eph is an array of intermediate hidden states) """
dW2 = np.dot(eph.T, epdlogp).ravel()
dh = np.outer(epdlogp, model['W2'])
dh[eph <= 0] = 0 # backpro prelu
dW1 = np.dot(dh.T, epx)
return {'W1':dW1, 'W2':dW2}
env = gym.make("Pong-v0")
observation = env.reset()
prev_x = None # used in computing the difference frame
xs,hs,dlogps,drs = [],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 0
while True:
if render: env.render()
# preprocess the observation, set input to network to be difference image
cur_x = prepro(observation)
x = cur_x prev_x if prev_x is not None else np.zeros(D)
prev_x = cur_x
# forward the policy network and sample an action from the returned probability
aprob, h = policy_forward(x)
action = 2 if np.random.uniform() < aprob else 3 # roll the dice!
# record various intermediaries (needed later for backprop)
xs.append(x) # observation
hs.append(h) # hidden state
y = 1 if action == 2 else 0 # a "fake label"
dlogps.append(y aprob) # grad that encourages the action that was taken to be taken
# step the environment and get new measurements
observation, reward, done, info = env.step(action)
reward_sum += reward
drs.append(reward) # record reward (has to be done after we call step() to get the reward for the previous action)
if done: # an episode finished
episode_number += 1
# stack together all inputs, hidden states, action gradients, and rewards for this episode
epx = np.vstack(xs)
eph = np.vstack(hs)
epdlogp = np.vstack(dlogps)
epr = np.vstack(drs)
xs,hs,dlogps,drs = [],[],[],[] # reset array memory
# compute the discounted reward backwards through time
discounted_epr = discount_rewards(epr)
# standardize the rewards to be unit normal (helps control the gradient estimator variance)
discounted_epr -= np.mean(discounted_epr)
discounted_epr /= np.std(discounted_epr)
epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.)
grad = policy_backward(eph, epdlogp)
for k in model: grad_buffer[k] += grad[k] # accumulate grad over batch
# perform rmsprop parameter update every batch_size episodes
if episode_number % batch_size == 0:
for k,v in model.items():
g = grad_buffer[k] # gradient
rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 decay_rate) * g**2
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
grad_buffer[k] = np.zeros_like(v) # reset batch gradient buffer
# book-keeping work
running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
print('resetting env. episode reward total was %f. running mean: %f' % (reward_sum, running_reward))
if episode_number % 100 == 0: pickle.dump(model, open('save.p', 'wb'))
reward_sum = 0
observation = env.reset() # reset environment
prev_x = None
if reward != 0: # Pong has either +1 or -1 reward exactly when the game ends.
print('ep %d: game finished, reward: %f' % (episode_number, reward) + ('' if reward == 1 else ' !!!!!!' ))
view raw pg-pong.py hosted with ❤ by GitHub

After running the code for a solid weekend, I was able to build an agent which can hold its own against the CPU, though won’t dominate the game. Still, it’s nice to see an example of training a computer to perform a reasonably complex task (deflecting a ball into the opponent’s goal while preventing the same) when all you provide is a set of possible instructions on how to act (move the paddle up or down) and an indication of how you did in the prior round.

Space age graphics!

New DataCamp Course: Data Modeling in Power BI

I wanted to announce a brand new DataCamp course, entitled Data Modeling in Power BI. This course provides an introduction to techniques which you can use to simplify and speed up Power BI data models, with an emphasis on dimensional modeling and the Kimball technique of creating and working with star schemas.

This is a little bit different from my last DataCamp course, in that I was a collaborator on this one, so all of my work was behind the scenes. The final course is a product of my vision and Sara & Maarten’s excellent job of implementation. So go check out the course and share your thoughts.

Upcoming Events: Techorama Virtual Edition

Key Details

What: Techorama 2021.
Where: Internet Belgium, UTC+2.
When: Monday, May 17th through Wednesday, May 19th.
Tickets are available for sale on the Techorama website.

What I’m Presenting

8:45 AM — 9:45 AM EDT — Of Types and Measures

It’s a little rare that I get to give an F#-focused talk, so I’m glad they selected this one. Also, Techorama is a talk I really would like to get to do in person one of these years.

SQL Day Poland

Key Details

What: SQL Day Poland.
Where: Internet Poland.
When: Monday, May 10th through Wednesday, May 12th.
Admission is paid, 500 Polish zloty for the conference (roughly $130 USD). RSVP on the SQL Day website.

What I’m Presenting

9:00 AM — 10:00 AM EDT — Does this look weird to you? An introduction to Anomaly Detection

This is a nearly new talk: I’ve given it a couple of times at user groups to warm it up, so I’m starting to get into the groove with this talk.

Upcoming Events: St Louis SQL Server and BI User Group

Key Details

What: St. Louis SQL Server and Business Intelligence User Group.
Where: On the Internet, UTC-5.
When: Tuesday, May 11th.
Admission is free. RSVP on Meetup.

What I’m Presenting

1:00 PM — 2:30 PM EDT — The Curated Data Platform

I enjoy giving this talk. It’s a whirlwind tour of data platform products and straddling the line between “Come check out all of these technologies!” and “Maybe you don’t need all of these technologies…”