Last month, at SQL Saturday Charlotte, I had the opportunity to meet Kiran Math, who gave a great talk introducing R to SQL developers. His talk was the basis for my blog post on dplyr and TidyR, and I want to walk through his housing market example in some detail here. A lot of this will follow the code he made available in his talk, but I also want to take this opportunity to move in a couple of directions he couldn’t due to time constraints.
Getting A Data Set
Basically, Kiran has a co-worker who wanted to move from one ZIP code (29605) to another (29615), and his co-worker wants to figure out how much he will likely get for his house when selling, and how much a similarly-sized house will cost in the new ZIP code. To figure this out, his co-worker pulled together data for all home sales in those two ZIP codes over a time period of several months and saved it. For this post, I’m going to use his CSV file, which has five fields: ID (auto-incrementing identity integer), SaleDate (date type), Area (number of square feet), ZIP_29615 (how much it sold for in ZIP code 29615, in thousands of dollars), and ZIP_29605 (how much it sold for in ZIP code 29605, in thousands of dollars). This isn’t a particularly good data structure, but developers will be developers…
Improving The Data Set
The first thing we need to do is import that data into R. Grab Kiran’s file and make sure you have dplyr, tidy, and ggplot2 installed:
install.packages("dplyr") install.packages("tidyr") install.packages("ggplot2") library("dplyr") library("tidyr") library("ggplot2")
Now that we have those packages installed, let’s load the data set from CSV. I’ll assume that you’re running this in Linux and the CSV is in your home directory’s Downloads folder:
path2csv <- file.path('~/Downloads/vHomeSales.csv') dat <- read.csv(path2csv, as.is = TRUE) head(dat) str(dat)
Right off the bat, we can see some interesting things. First of all, looking at the data set, we see a lot of NA records. This is because the initial data set is an unpivoted set. An individual sale takes place in one of the two ZIP codes, but we have separate columns for each ZIP code. This means that only one of the two ZIP code columns will ever have a value.
> head(dat) ID SaleDate Area ZIP_29615 ZIP_29605 1 1 6/3/2015 1300 NA 62 2 2 6/3/2015 1310 NA 67 3 3 6/3/2015 1320 NA 62 4 4 6/3/2015 1350 NA 57 5 5 6/3/2015 1370 NA 72 6 6 6/3/2015 1300 NA 62 > str(dat) 'data.frame': 432 obs. of 5 variables: $ ID : int 1 2 3 4 5 6 7 8 9 10 ... $ SaleDate : chr "6/3/2015" "6/3/2015" "6/3/2015" "6/3/2015" ... $ Area : int 1300 1310 1320 1350 1370 1300 1400 1400 1440 1450 ... $ ZIP_29615: int NA NA NA NA NA NA 158 NA 160 NA ... $ ZIP_29605: int 62 67 62 57 72 62 NA 82 NA 85 ...
Also, when running str, we see that SaleDate is a character type, based on how it was read in. So let’s fix that now:
dat$SaleDate <- as.Date(dat$SaleDate, format="%m/%d/%Y")
Remember that last time we tried this, I needed to do some fancy formatting to get dates to load. In Kiran’s code, he does not specify a format, but I noticed that the dates looked a bit off when I ran it, so I am specifying a format. Kiran’s code also has us converting the ZIP Code data to integers, but as you see above, when I ran the code in RStudio, the ZIP Code fields were already integers, so I’m skipping that.
From here, we get into using dplyr and TidyR. My post on using dyplr and TidyR should serve as a useful introduction to these two tools. The first thing we want to do is grab a subset of variables. In this case, we don’t care about the ID. We also want to get all sales since January 1st of 2015, but don’t need to know the sale date. dplyr has a filter function which works great for this scenario:
tDat <- dat %>% filter(SaleDate > '1/1/2015') %>% select(Area,ZIP_29615,ZIP_29605)
Now that we have the three variables we want, it’s time to correct the data set. We really just want two variables: Area and ZIPCode. Pivoting this data set will get rid of the NA values quite nicely and make it easier for us to chart, plot, and manipulate the data. We’re going to gather the two ZIP code columns together and leave the other variable alone:
gDat<- gather(tDat, Zipcode, SalePrice, 2:3, na.rm=TRUE) head(gDat)
This leaves us an interesting intermediate result set:
> head(gDat) Area Zipcode SalePrice 7 1400 ZIP_29615 158 9 1440 ZIP_29615 160 11 1460 ZIP_29615 160 13 1460 ZIP_29615 157 15 1560 ZIP_29615 168 17 1580 ZIP_29615 173 > str(gDat) 'data.frame': 432 obs. of 3 variables: $ Area : int 1400 1440 1460 1460 1560 1580 1600 1620 1700 1800 ... $ Zipcode : Factor w/ 2 levels "ZIP_29615","ZIP_29605": 1 1 1 1 1 1 1 1 1 1 ... $ SalePrice: int 158 160 160 157 168 173 168 173 188 198 ...
Basically, we moved the variable name (ZIP_29605 or ZIP_29615) into a variable called Zipcode and took the value of that variable and moved it into SalePrice. We left Area alone, so it’s the same as in tDat. But here’s the thing about the remaining, pivoted data set: the ZIP codes start with ZIP_, and we’d rather get rid of that and change it to “29615” or “29605.” Also, we want to make sure that Zipcode remains a factor: 29615 and 29605 are discrete labels, not continuous variables. The following bits of code do this:
mDat <- mutate(gDat,Zipcode=extract_numeric(Zipcode)) mDat$Zipcode <- as.factor(mDat$Zipcode)
Playing With The Data
At this point, we have our final data set, with ZIP Code, square footage, and sale price (in thousands of dollars).
> head(mDat) Area Zipcode SalePrice 1 1400 29615 158 2 1440 29615 160 3 1460 29615 160 4 1460 29615 157 5 1560 29615 168 6 1580 29615 173 > str(mDat) 'data.frame': 432 obs. of 3 variables: $ Area : int 1400 1440 1460 1460 1560 1580 1600 1620 1700 1800 ... $ Zipcode : Factor w/ 2 levels "29605","29615": 2 2 2 2 2 2 2 2 2 2 ... $ SalePrice: int 158 160 160 157 168 173 168 173 188 198 ...
The first thing you typically want to do with a data set is plot it. Our minds are pretty good at making sense of visualizations, and we can use a basic plot to see if there are any trends we can discern.
We can definitely see a trend here. It looks like the houses in ZIP Code 29615 are more expensive than similarly-sized houses in 29605, except for one red outlier, which might be miscoded data or could just be somebody who got a great deal selling his house.
Modeling The Data
So now that we see a bit of a trend, the next question is to ask, how much money could we expect to get if we sold a house in ZIP Code 29605? To do that, we’re going to build a linear model using Ordinary Least Squares. R has a great built-in function for that, namely lm.
dat29605 <- filter(mDat, Zipcode == 29605) model_sale <- lm(SalePrice~Area,data=dat29605) predict(model_sale,data.frame(Area=3000),interval="predict" )
The above code does three things. First, it filters out everything whose ZIP Code is not 29605, leaving the relevant data set. Next, we build a linear model in which we predict sale price based on the house’s square footage, using dat29605 as our data set. Finally, we make a prediction based off of this model, asking how much we might expect to get for a house whose size is 3000 square feet. The result:
> predict(model_sale,data.frame(Area=3000),interval="predict" ) fit lwr upr 1 198.0433 146.3831 249.7035
The mean point is approximately $198,000, and a 95% confidence interval gives us a range of $146,000 through $249,000. That’s not an extremely precise estimate, but it gives us a starting point for negotiations.
Similarly, let’s say that we want to buy a house in ZIP Code 29615 which is the same size. Here’s what we come up with:
dat29615 <- filter(mDat, Zipcode == 29615) model_sale_29615 <- lm(SalePrice~Area,data=dat29615) predict(model_sale_29615,data.frame(Area=3000),interval="predict" )
And now the prediction:
> predict(model_sale_29615,data.frame(Area=3000),interval="predict" ) fit lwr upr 1 315.4288 267.0714 363.7861
So we can sell a house in ZIP Code 29605 for about $198K (give or take $50K) and purchase an equally-sized house in ZIP Code 29615 for about $315K (give or take $40K).
If we want more information on the linear model results, we can use the built-in summary() function. Running summary against each model gives us the following results:
> summary(model_sale) Call: lm(formula = SalePrice ~ Area, data = dat29605) Residuals: Min 1Q Median 3Q Max -56.180 -18.361 1.147 19.668 108.437 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -34.356979 4.938838 -6.956 4.14e-11 *** Area 0.077467 0.001573 49.251 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 26.15 on 215 degrees of freedom Multiple R-squared: 0.9186, Adjusted R-squared: 0.9182 F-statistic: 2426 on 1 and 215 DF, p-value: < 2.2e-16
> summary(model_sale_29615) Call: lm(formula = SalePrice ~ Area, data = dat29615) Residuals: Min 1Q Median 3Q Max -67.45 -19.12 1.39 19.05 71.78 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.197209 5.240978 1.564 0.119 Area 0.102411 0.001561 65.608 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 24.47 on 213 degrees of freedom Multiple R-squared: 0.9528, Adjusted R-squared: 0.9526 F-statistic: 4304 on 1 and 213 DF, p-value: < 2.2e-16
Both of these models show that there is a strong relationship between Area and SalePrice. Intuitively, we understand this well: bigger houses will, all other things being equal, sell for more money. In this case, differences in size accounted for 91% of the variance in sale price in ZIP Code 29605 and 95% in ZIP Code 29615. This says that, within a ZIP Code, size is the primary consideration. So how about a model which combines the two ZIP Codes together?
model_sale_all <- lm(SalePrice~Area,data=mDat) summary(model_sale_all)
When we look at the linear model results, we get:
> summary(model_sale_all) Call: lm(formula = SalePrice ~ Area, data = mDat) Residuals: Min 1Q Median 3Q Max -143.134 -62.336 2.531 62.334 138.647 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -28.296646 9.303846 -3.041 0.0025 ** Area 0.095344 0.002863 33.306 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 65.85 on 430 degrees of freedom Multiple R-squared: 0.7207, Adjusted R-squared: 0.72 F-statistic: 1109 on 1 and 430 DF, p-value: < 2.2e-16
When looking at both data sets together, square footage is a dominant factor, explaining 72% of the variance in price. But notice that we went from explaining 90-95% of the variance down to 72%. This tells us that there are other factors at play here. We can speculate on some of these factors—maybe better schools, lower tax rates, superior amenities, or maybe ZIP Code 29605 is full of jerks and nobody wants to live near them.
Thanks again to Kiran Math for the work he put into this introduction to R. You can grab all of his code, as well as Power Point slides, at the link above.