R Training In Washington DC

Are you interested in learning R but don’t know where to begin?  Do you have corporate training funds burning a hole in your pocket and you desperately need to spend them before the year runs out?  Or alternatively, do you want some good training before an action-packed SQL Saturday but don’t have a big budget?

I am giving a full-day R training on Friday, December 7th in Washington DC.  Tickets are still available for this training event.  Here’s the abstract:

DESCRIPTION

Enter The Tidyverse: R for the Data Professional

In this day-long training, you will learn about R, the premier language for data analysis. We will approach the language from the standpoint of data professionals: database developers, database administrators, and data scientists. We will see how data professionals can translate existing skills with SQL to get started with R. We will also dive into the tidyverse, an opinionated set of libraries which has modernized R development. We will see how to use libraries such as dplyr, tidyr, and purrr to write powerful, set-based code. In addition, we will use ggplot2 to create production-quality data visualizations.
Over the course of the day, we will look at several problem domains. For database administrators, areas of note will include visualizing SQL Server data, predicting error occurrences, and estimating backup times for new databases. We will also look at areas of general interest, including analysis of open source data sets.
No experience with R is necessary. The only requirements are a laptop and an interest in leveling up your data professional skillset.

I’ve done this training a few times now and have settled on Azure Notebooks, which makes it easier for people to take this stuff home and play around (for free!) on their own.  If you attend, you get a jam-packed day of training as well as two dozen notebooks full of material to get you from “new to R” to solving practical problems in your environment.

Sign up today!

Advertisements

PASS Summit 2018 Evaluation Ratings & Comments

Following up on Brent Ozar’s post on the topic, I figured I’d post my own ratings (mostly because they’re not awful!).  This was my first PASS Summit at which I was a speaker, so I don’t have a good comp for scores except what other speakers publish.  I had the privilege of giving three presentations at PASS Summit this year, and I’m grateful for everyone who decided to sit in on these rather than some other talk.  All numeric responses are on a 1-5 scale with 5 being the best.

Applying Forensic Accounting Techniques Using SQL and R

This was a talk that I’ve given a few times and even have an extra-long director’s cut version available.  I had 71 attendees and 14 responses.

Eval Question
Avg Rating
Rate the value of the session content.
4.21
How useful and relevant is the session content to your job/career?
3.43
How well did the session’s track, audience, title, abstract, and level align with what was presented?
4.21
Rate the speaker’s knowledge of the subject matter.
4.86
Rate the overall presentation and delivery of the session content.
4.29
Rate the balance of educational content versus that of sales, marketing, and promotional subject matter.
4.64

The gist of the talk is, here are techniques that forensic accountants use to find fraud; you can use them to learn more about your data.  I fell flat in making that connection, as the low “useful” score shows.  That’s particularly bad because I think this is probably the most “immediately useful” talk that I did.

Event Logistics Comments

  • Room was very cold
  • very cold rooms, very aggressive air-con
  • The stage was squeaky and made banging noises when the speaker was trying to present. Not their fault! The stage just didn’t seem very stable. Also the room had a really unpleasant smell.
  • Everything was great!

The squeak was something I noticed before the talk.  I thought about staying in place to avoid the squeak, but this is a talk where I want to gesticulate to emphasize points—like moving from one side of the stage to the other to represent steps in a process.  My hope was that the squeak wouldn’t be too noticeable but the microphone may have picked it up.

Speaker Comments

  • I was just curious about the topic, but the speaker inspired me with many smaller, but very feasible tips and tricks of how to look at a data! Thank You!
  • The Jupyter notebooks were awesome. I felt the speaker really knew their stuff.  But the downsides were that the analysis methods discussed weren’t really shown to us, or were so far out of context I didn’t quite see how to use them or how they related to the demos. Multiple data sets were used and maybe just focusing all the methods on one of them may have worked better? I just felt overall it was a really interesting topic with a lot of work done but it just didn’t come together for me. Sorry.
  • I like understanding how fraud got uncovered by looking at data. Thanks
  • Very interesting session. Speak made the subject very interesting. I’ve picked up a few ideas I can use in my job.
  • Some examples of discovery of fraud would have been more effective.
  • One of my favorite sessions at PASS. Thank you for making the jupyter notebook available for download.
  • Great speaker/content, would attend again.

The second comment is exactly the kind of comment I want.  My ego loves the rest of the comments, but #2 makes me want to tear this talk apart and rebuild it better.  The biggest problem that I have with the talk is that my case study involved actual fraud, but none of the data sets I have really show fraud.  I’m thinking of rebuilding this talk using just one data set where I seed in fraudulent activities and expose it with various techniques.  Ideally, I’d get a copy of the case study’s data, but I never found it anywhere.  Maybe I could do a FOIA request or figure out some local government contact.

Getting Started with Apache Spark

My second session was Friday morning.  I had 100 somewhat-awake attendees but only 5 responses, so take the ratings with a grain of salt.

Eval Question
Avg Rating
Rate the value of the session content.
4.80
How useful and relevant is the session content to your job/career?
4.60
How well did the session’s track, audience, title, abstract, and level align with what was presented?
4.80
Rate the speaker’s knowledge of the subject matter.
5.00
Rate the overall presentation and delivery of the session content.
4.80
Rate the balance of educational content versus that of sales, marketing, and promotional subject matter.
5.00

This is a talk that I created specifically for PASS Summit.  I’m happy that it turned out well, considering that there was a good chance of complete demo failure:  my Portable Hadoop Cluster was finicky that morning and wanted to connect to the Internet to grab updates before it would let me run anything.  Then I had to restart the Apache Zeppelin service mid-talk to run any notebooks, but once that restarted successfully, the PHC ran like a champ.

Event Logistics Comments

  • Good

Speaker Comments

  • Session was 100, but I would say 200-300
  • Great presentation!

Getting rating levels right is always tricky.  In this case, I chose 100 rather than 200 because I spent the first 30+ minutes going through the history of Hadoop & Spark and a fair amount of the remaining time looking at Spark SQL.  But I did have a stretch where I get into RDD functions and most T-SQL developers will be unfamiliar with map, reduce, aggregate, and other functions.  So that’s a fair point—calling it a 200 level talk doesn’t bother me.

Cleaning is Half the Battle:  Launching a Data Science Project

This was my last PASS Summit talk, which I presented in the last session slot on Friday.  I had 31 attendees and 7 responses.

Eval Question
Avg Rating
Rate the value of the session content.
4.71
How useful and relevant is the session content to your job/career?
4.71
How well did the session’s track, audience, title, abstract, and level align with what was presented?
4.86
Rate the speaker’s knowledge of the subject matter.
4.86
Rate the overall presentation and delivery of the session content.
4.57
Rate the balance of educational content versus that of sales, marketing, and promotional subject matter.
4.71

Again, small sample size bias applies.

Event Logistics Comments

  • Good
  • Great

Speaker Comments

  • You have a lot content, leading to a rushed talk. Also, your jokes have potential, if u slowed down and sold them better
  • Good presentation. I expected more on getting the project off the ground, but enjoyed the info.
  • Funny and informative–I truly enjoyed your presentation!
  • Great
  • Kevin knows how to present, especially for getting stuck w/ the last session of the last day. He brought a lot of energy to the room. Content was on key too, he helped me understand more about handling data that we don’t want to model.

The slowing down comment is on point.  This is a 90-minute talk by its nature.  I did drop some content (like skipping slides on data cleansing and analysis and just showing the demos) so that I could spend a little more time on the neural network portion of the show, but I had to push to keep on time and technically went over by a minute or two.  I was okay with the overage because it was the final session, so I wasn’t going to block anybody.

Synthesis and Commentary

The ratings numbers are something to take with several grains of salt:  26 ratings over 3 sessions isn’t nearly a large enough sample to know for sure how these turned out.  But here are my thoughts from the speaker’s podium.

  • I speak fast.  I know it and embrace it—I know of the (generally good) advice that you want to go so slow that it feels painful, but I’ll never be that person.  In the Ben Stein — Billy Mays continuum, I’d rather err on the Oxyclean side.
  • I need to cut down on material.  In the first and last talks, they could both be better with less.  The problem with cutting material in the data science process talk is that I’d like to cover the whole process with a realistic-enough example and that takes time.  So this is something I’ll have to work on.
  • I might need to think of a different title for my data science process talk.  I explicitly call out that it’s about launching a data science project, but as I was sitting in a different session, I overheard a couple of people mention the talk and one person said something along the lines of not being interested because he’s already seen data cleansing talks.  The title is a bit jokey and has a punchline in the middle of the session, so I like it, but maybe something as simple as swapping the order of the segments to “Launching a Data Science Project:  Cleaning is Half the Battle” would be enough.
  • Using a timer was a really good idea.  I normally don’t use timers and instead go by feel at SQL Saturdays and user group meetings, and that leads to me sometimes running short on time.  I tend to practice talks with a timer to get an idea of how long they should last, but rarely re-time myself later, so they tend to shift in length as I do them.  Having the timer right in front of me helped keep me on track.
  • For the Spark talk, I think when I create my normal slide deck, I’m going to include the RDD (or “Spark 1.0”) examples as inline code segments and walk through them more carefully.  For an example of what I mean, I have a section in my Classification with Naive Bayes talk where I walk through classification of text based on word usage.  Normally, I’d make mention of the topic and go to a notebook where I walk through the code.  But that might have been a little jarring for people brand new to Spark.
  • I tend to have a paucity of images in talks, making up for it by drawing a lot on the screen.  I personally like the effect because action and animation keep people interested and it’s a lot easier for me to do that by drawing than by creating the animations myself…  It does come with the downside of making the slides a bit more turgid and making it harder for people to review the slides later as they lose some of that useful information.  As I’ve moved presentations to GitPitch I’ve focused on adding interesting but not too obtrusive backgrounds in the hopes that this helps.  Still, some of the stuff that I regularly draw should probably show up as images.

So it’s not perfect, but I didn’t have people hounding me with pitchforks and torches after any of the sessions.  I have some specific areas of focus and intend to take a closer look at most of my talks to improve them.

GitPitch: Revamping My Slide Decks

Over the past few months, I’ve started transitioning my slide decks to GitPitch.  I’ve used reveal.js for years and enjoyed being able to put together presentations in HTML format.  I can run these slides in any browser (or even phones) and don’t have to worry about Power Point messing up when I go to presentation mode.  I also get the ability to transition up and down as well as left and right, which lets me “drill into” topics.  For example, from my Genetic Algorithms talk:

If I want to go into a topic, I can slide down; if I need to skip it for time, I can bypass the section.

GitPitch also uses Reveal, so many of the skills that I’ve built up still work for me.  But there are a few compelling reasons for me to migrate.

Reasons For Moving To GitPitch

So why did I move to GitPitch?  It started with an offer to Microsoft MVPs where they offer a free year of GitPitch Pro.  For transparency purposes, I am taking advantage of that free year.  After the year is up, I’m going to start paying for the service, but I did get it for free.

So why did I make the jump?  There are a few reasons.

Background Images

One of the best features of GitPitch (and a reason I went with Pro over the Free edition) is the ability to use a background image and set the transparency level.  This lets me add flavor to my slide decks without overwhelming the audience:

There are a number of good sources for getting free images you can use for these purposes, including UnsplashCreative Commons Search, and SkyPixel for drone photos.  I’m also working to introduce my own photography when it makes sense.

Markdown

Reveal.js uses HTML to build slides, but GitPitch uses Markdown.  Markdown is a fairly easy syntax to pick up and is a critical part of Jupyter Notebooks.

To build the slide above, the markdown looks like this:

---?image=presentation/assets/background/Background_13.jpg&size=cover&opacity=15

### Outlook

|Outlook|Yes|No|P(Yes)|P(No)|
|-------|---|--|------------|
|Sunny|2|3|2/9|3/5|
|Overcast|4|0|4/9|0/5|
|Rainy|3|2|3/9|2/5|
|Total|9|5|100%|100%|

No HTML and it’s pretty easy to make changes.  Because it’s a text-based format, major changes are also pretty easy to find-and-replace, something hard to do with Power Point.

Math

Recently, I did a presentation on Naive Bayes classifiers.  To display math, we can use MathJax.  Here’s an example slide:

---

### Applying Bayes' Theorem

Supposing multiple inputs, we can combine them together like so:

`$P(B|A) = \dfrac{P(x_1|B) * P(x_2|B) * ... * P(x_n|B)}{P(A)}$`

This is because we assume that the inputs are **independent** from one another.

Given `$B_1, B_2, ..., B_N$` as possible classes, we want to find the `$B_i$` with the highest probability.

And here’s how it looks:

I’m definitely not very familiar with MathJax, but there are online editors to help you out.

Code Samples

Another cool feature is the ability to include code samples:

What’s really cool is the ability to walk through step by step:

---

```r
if(!require(naivebayes)) {
install.packages("naivebayes")
library(naivebayes)
}
if(!require(caret)) {
install.packages("caret")
library(caret)
}

data(iris)

set.seed(1773)
irisr <- iris[sample(nrow(iris)),]
irisr <- irisr[sample(nrow(irisr)),]

iris.train <- irisr[1:120,]
iris.test <- irisr[121:150,]

nb <- naivebayes::naive_bayes(Species ~ ., data = iris.train)
#plot(nb, ask=TRUE)

iris.output <- cbind(iris.test, 
prediction = predict(nb, iris.test))
table(iris.output$Species, iris.output$prediction)
confusionMatrix(iris.output$prediction, iris.output$Species)
```

@[1-4](Install the naivebayes package to generate a Naive Bayes model.)
@[5-8](Install the caret package to generate a confusion matrix.)
@[12-14](Pseudo-randomize the data set. This is small so we can do it by hand.)
@[16-17](Generate training and test data sets.)
@[19](Building a Naive Bayes model is as simple as a single function call.)
@[22-25](Generate predictions and analyze the resulting predictions for accuracy.)

I haven’t used this as much as I want to, as my talks historically have not included much code—I save that for the demos.  But with the ability to walk through code a section at a time, it makes it easier to present code directly.

Up On GitHub

Something I wasn’t a huge fan of but grew to be okay with was that the markdown files and images are stored in your GitHub repo.  For example, my talk on the data science process is just integrated into my GitHub repo for the talk.  This has upsides and downsides.  The biggest upside is that I don’t have to store the slides work anywhere else, but I’ll describe the downside in the next section.

Tricky Portions

There are a few things that I’ve had to work out in the process.

Handling Updates

Testing things out can be a bit annoying because you have to push changes to GitHub first.  Granted, this isn’t a big deal—commit and push, test, commit and push, test, rinse and repeat.  But previously I could save the HTML file, open it locally, and test the results.  That was a smoother process, though as I’ve built up a couple of decks, I have patterns I can follow so that reduces the pain quite a bit.

Online Only (Mostly)

There is an offline slideshow mode in GitPitch Pro and a desktop version, but I haven’t really gotten into those.  It’s easier for me just to work online and push to my repos.

When presenting, I do need to be online to grab the presentation, but that’s something I can queue up in my hotel room the night before if I want—just keep the browser window open and I can move back and forth through the deck.

Down Transitions And Background Images

One thing I’ve had to deal with when migrating slide decks is that although I can still use the same up-down-left-right mechanics in Reveal.js, when using transparency on images, I’ve noticed that the images tend to bleed together when moving downward.  I’ve dealt with the problem by going back to a “classic” left-right only slide transition scheme.

Wrapping Up

I’ve become enough of a fan of GitPitch to decide that I’m going to use that as my default.  I think the decks are more visually compelling:  for example, there’s my original data science process slide deck and my GitPitch version.  As far as content goes, both decks tell the same story, but I think the GitPitch version retains interest a bit better.  As I give these talks in the new year, I’ll continue transitioning to GitPitch, and new talks will go there by default.

Enter The Tidyverse: Full-Day R Training In Charlotte

I am giving a full-day R training on Friday, October 19th in Charlotte, North Carolina.  Tickets are still available for this training event.  Here’s the abstract:

DESCRIPTION

SQL Saturday Charlotte and the Charlotte BI Group are pleased to offer this full day workshop with Kevin Feasel.

Enter The Tidyverse: R for the Data Professional

In this day-long training, you will learn about R, the premier language for data analysis. We will approach the language from the standpoint of data professionals: database developers, database administrators, and data scientists. We will see how data professionals can translate existing skills with SQL to get started with R. We will also dive into the tidyverse, an opinionated set of libraries which has modernized R development. We will see how to use libraries such as dplyr, tidyr, and purrr to write powerful, set-based code. In addition, we will use ggplot2 to create production-quality data visualizations.
Over the course of the day, we will look at several problem domains. For database administrators, areas of note will include visualizing SQL Server data, predicting error occurrences, and estimating backup times for new databases. We will also look at areas of general interest, including analysis of open source data sets.
No experience with R is necessary. The only requirements are a laptop and an interest in leveling up your data professional skillset.

With Azure Notebooks, teaching and learning R becomes a lot easier—rather than fighting computers trying to install all of the pre-requisite software, we can jump straight to the notebooks.

Upcoming Speaking Engagements

It’s been a busy few months, so I’m going to keep radio silent for a little while longer (though Curated SQL is going strong).  In the meantime, here’s where I’m going to be over the next month:

  1. On Saturday, May 19th, I’ll be in New York City for SQL Saturday NYC, where I’m going to do two presentations:  Using Kafka for Real-Time Data Ingestion with .NET and Much Ado About Hadoop.
  2. On Tuesday, May 22nd, I’ll be in Charlotte presenting for the Enterprise Developers Guild, giving a talk entitled A .NET Developer’s View of the OWASP Top 10.
  3. On Thursday, May 31st, I’m going to give a paid pre-con entitled Enter the Tidyverse for SQL Saturday Mexico.  Click the link for instructions on signing up.
  4. Then, on Saturday, June 2nd, I’ll be in Mexico City for SQL Saturday Mexico, where I will present two talks:  Data Cleansing with SQL and R and Working Effectively with Legacy SQL.
  5. On Thursday, June 7th, I will present my R for the SQL Server Developer talk to the Roanoke Valley .NET User Group.
  6. Then, on June 9th, I’m going to give 3 sessions at SQL Saturday South Florida.  Those talks are Eyes on the Prize, Much Ado About Hadoop, and APPLY Yourself.
  7. I’ll be in Houston, Texas on June 23rd for SQL Saturday Houston.  There’s no official confirmation on talk(s) just yet, but I can confirm that I’ll be there and will do at least one session.

I have a few more irons in the fire as well, but this wraps up my May and June.

Enter The Tidyverse, Columbus Edition

In conjunction with SQL Saturday Columbus, I am giving a full-day training session entitled Enter the Tidyverse:  R for the Data Professional on Friday, July 27th.  This is a training that I did earlier in the year in Madison, Wisconsin, and aside from having no voice at the end, I think it went really well.  I’ve tweaked a couple of things to make this training even better; it’s well worth the low, low price of $100 for a full day of training on the R programming language.

I use the term “data professional” on purpose:  part of what I do with this session is show attendees how, even if they are database administrators, it can pay to know a bit about the R programming language.  Database developers, application developers, and budding data scientists will also pick up a good bit of useful information during this training, so it’s fun for the whole data platform.

Throughout the day, we will use a number of data sources which should be familiar to database administrators:  wait stats, database backup times, Reporting Services execution log metrics, CPU utilization statistics, and plenty more.  These are the types of things which database administrators need to deal with on a daily basis, and I’ll show you how you can use R to make your life easier.

If you sign up for the training in Columbus, the cost is only $100 and you’ll walk away with a better knowledge of how you can level up your database skills with the help of a language specially designed for analysis.  Below is the full abstract for my training session.  If this sounds interesting to you, sign up today!  I’m not saying you should go out and buy a couple dozen tickets today, but you should probably buy one dozen today and maybe a dozen more tomorrow; pace yourself, that’s all I’m saying.

Course Description

In this day-long training, you will learn about R, the premiere language for data analysis.  We will approach the language from the standpoint of data professionals:  database developers, database administrators, and data scientists.  We will see how data professionals can translate existing skills with SQL to get started with R.  We will also dive into the tidyverse, an opinionated set of libraries which has modernized R development.  We will see how to use libraries such as dplyr, tidyr, and purrr to write powerful, set-based code.  In addition, we will use ggplot2 to create production-quality data visualizations.

Over the course of the day, we will look at several problem domains.  For database administrators, areas of note will include visualizing SQL Server data, predicting error occurrences, and estimating backup times for new databases.  We will also look at areas of general interest, including analysis of open source data sets.

No experience with R is necessary.  The only requirements are a laptop and an interest in leveling up your data professional skillset.

Intended Audience

  • Database developers looking to tame unruly data
  • Database administrators with an interest in visualizing SQL Server metrics
  • Data analysts and budding data scientists looking for an overview of the R landscape
  • Business intelligence professionals needing a powerful language to cleanse and analyze data efficiently

Contents

Module 0 — Prep Work

  • Review data sources we will cover during the training
  • Ensure laptops are ready to go

Module 1 — Basics of R

  • What is R?
  • Basic mechanics of R
  • Embracing functional programming in R
  • Connecting to SQL Server with R
  • Identifying missing values, outliers, and obvious errors

Module 2 — Intro To The Tidyverse

  • What is the Tidyverse?
  • Tidyverse principles
  • Tidyverse basics:  dplyr, tidyr, readr, tibble

Module 3 — Dive Into The Tidyverse

  • Data loading:  rvest, httr, readxl, jsonlite, xml2
  • Data wrangling:  stringr, lubridate, forcats, broom
  • Functional programming:  purrr

Module 4 — Plotting

  • Data visualization principles
  • Chartjunk
  • Types of plots:  good, bad, and ugly
  • Plotting data with ggplot2
    • Exploratory plotting
    • Building professional quality plots

Module 5 — Putting it Together:  Analyzing and Predicting Backup Performance

  • A capstone notebook which covers many of the topics we covered today, focusing on Database Administration use cases
  • Use cases include:
    • Gathering CPU statistics
    • Analyzing Disk Utilization
    • Analyzing Wait Stats
    • Investigating Expensive Reports
    • Analyzing Temp Table Creation Stats
    • Analyzing Backup Times

Course Objectives

Upon completion of this course, attendees will be able to:

  • Perform basic data analysis with the R programming language
  • Take advantage of R functions and libraries to clean up dirty data
  • Build a notebook using Jupyter Notebooks
  • Create data visualizations with ggplot2

Pre-Requisites

No experience with R is necessary, though it would be helpful.  Please bring a laptop to follow along with exercises and get the most out of this course.

What Comes After Go-Live?

This is part eight of a series on launching a data science project.

At this point in the data science process, we’ve launched a product into production.  Now it’s time to kick back and hibernate for two months, right?  Yeah, about that…

Just because you’ve got your project in production doesn’t mean you’re done.  First of all, it’s important to keep checking the efficacy of your models.  Shift happens, where a model might have been good at one point in time but becomes progressively worse as circumstances change.  Some models are fairly stable, where they can last for years without significant modification; others have unstable underlying trends, to the point that you might need to retrain such a model continuously.  You might also find out that your training and testing data was not truly indicative of real-world data, especially that the real world is a lot messier than what you trained against.

The best way to guard against unbeknownst model shift is to take new production data and retrain the model.  This works best if you can keep track of your model’s predictions versus actual outcomes; that way, you can tell the actual efficacy of the model, figuring out how frequently and by how much your model was wrong.

Depending upon your choice of algorithm, you might be able to update the existing model with this new information in real time.  Models like neural networks and online passive-aggressive algorithms allow for continuous training, and when you’ve created a process which automatically feeds learned data back into your continuously-training model, you now have true machine learning. Other algorithms, however, require you to retrain from scratch.  That’s not a show-stopper by any means, particularly if your underlying trends are fairly stable.

Regardless of model selection, efficacy, and whether you get to call what you’ve done machine learning, you will want to confer with your stakeholders and ensure that your model actually fits their needs; as I mentioned before, you can have the world’s best regression, but if the people with the sacks of cash want a recommendation engine, you’re not getting the goods.  But that doesn’t mean you should try to solve all the problems all at once; instead, you want to start with a Minimum Viable Product (MVP) and gauge interest.  You’ve developed a model which solves the single most pressing need, and from there, you can make incremental improvements.  This could include relaxing some of the assumptions you made during initial model development, making more accurate predictions, improving the speed of your service, adding new functionality, or even using this as an intermediate engine to derive some other result.

Using our data platform survey results, assuming the key business personnel were fine with the core idea, some of the specific things we could do to improve our product would be:

  • Make the model more accurate.  Our MAE was about $19-20K, and reducing that error makes our model more useful for others.  One way to do this would be to survey more people.  What we have is a nice starting point, but there are too many gaps to go much deeper than a national level.
  • Introduce intra-regional cost of living.  We all know that $100K in Manhattan, NY and $100K in Manhattan, KS are quite different.  We would want to take into account cost of living, assuming we have enough data points to do this.
  • Use this as part of a product helping employers find the market rate for a new data professional, where we’d ask questions about the job location, relative skill levels, etc. and gin up a reasonable market estimate.

There are plenty of other things we could do over time to add value to our model, but I think that’s a nice stopping point.

What’s Old Is New Again

Once we get to this phase, the iterative nature of this process becomes clear.

The Team Data Science Project Lifecycle (Source)

On the micro level, we bounce around within and between steps in the process.  On the macro level, we iterate through this process over and over again as we develop and refine our models.  There’s a definite end game (mostly when the sacks of cash empty), but how long that takes and how many times you cycle through the process will depend upon how accurate and how useful your models are.

In wrapping up this series, if you want to learn more, check out my Links and Further Information on the topic.