Home price prediction is the cornerstone of Zillow’s real estate database business. Accurate and timely home price prediction is fundamental for buyers and sellers to establish mutually beneficial trades, as well as for government agencies to calculate tax liability and work out related policies. Although Zillow has constructed impressive prediction models for a number of cities in the US, its housing market predictions in San Francisco aren’t as accurate as they could be because they do not factor in enough local intelligence. Thus, we are pleased to present a new machine-learning model of better predictive and generalizable power in home price prediction, which will guarantee Zillow’s leading position in this highly profitable and influential field.
The nature of our machine learning model is to borrow the experience from recently transacted properties to predict the values of homes that have not yet been sold. The main challenge in building this model, as in building any other machine-learning model, is to summarize the best predictive variables (or ‘features’) from current sold-out-homes dataset (or ‘training set’) to generate enough predictive power, while avoiding over-fit on the training set to ensure its generalizability to new unsold-homes dataset (or ‘test set’). We also paid special attention on the prediction algorithm’s generalizability to different urban contexts. A lack of such generalizability may lead to systematically over or under predict in certain neighborhoods, which will not only impair analytic accuracy in real estate market, but also cause huge public policy problems. After much efforts, we believe that we have well managed the trade-off between accuracy and generalizability (or the ‘bias/variance trade-off’), and created a model that is predictive and generalizable to the many different places in and around San Francisco.
The following report is organized into 4 main parts: the source and processing of data, the modeling method and validation, the modelling result and its generalizability, and final remarks.
Our primary dataset is an information list of 9433 sold homes (with sale price) and 700 hold-homes (without sale price) randomly selected in San Francisco from 2012-2015 by Zillow. We also collected data from U.S. Census website and San Francisco’s Open Data portal, careful not to include any dataset that might include sales information or be derived from sales information (e.g. property tax dataset).
According to the hedonic model, there are 3 types of data that we used to predict San Francisco’s home price: internal characteristics, amenities and spatial structure.
We selected the number of bedrooms, bathrooms, stories, property area, age of the property and Sale Date from Zillow’s dataset as the main internal characteristics. We interpreted some original data with statistical techniques, as well as adjusted unreasonable values such as 0 stories and 52 bedrooms.
For amenities, we used “the distance between house and services” to measure the closeness of house to basic public services such as school and health care facilities as well as used “the quantity within buffer” to measure the density and diversity of different types of businesses, public art, tree, parks, public transits and road length (i.e. accessibility for private vehicles) surrounding the property. To find out which are the best business types to include as deciding factors of house price, we first run the regression between house price and different types of business, then selected the most significant predictors: massage, storage, art and entertainment, construction, private education and health services. Since different types of amenities intended to serve on different spatial scale, we adopted different sizes of buffer to measure different services.
For the spatial aspect, we calculated the slope map of San Francisco through the elevation data since it would impact the walkability and cyclability of neighborhoods, thus had an impact the house price. We also used “the distance to the nearest assault crime” as the safety-related variable. Apart from that, we added several variables from census data into our model to restore San Francisco’s spatial characteristics, including “percentage of people in povert”, “percentage of non-hispanic” and “median household income”. Additionally, we included the average sale price of the 5 nearest homes (“lagPrice”) into our model since the results of Moran’s I examination indicated that the value of a certain house is significantly related to the values of surrounding properties. More specific spatial structure variables such as Neighborhood and census tract are also considered in our model.
Following the machine learning methodology, we compiled and wrangled the raw data we collected into a consistent dataset for all 10133 home sites. After exploratory analysis and feature engineering, we settled on 27 most significant and inter-independent predictors of Internal characteristics, amenities and spatial structure:
summary_table <- houseprice.sf3%>%
dplyr::select(PropArea, Baths, Beds, Stories, year, SaleYr,
tree.Buffer100, transit600.Buffer, roadlength_200, parksdensity3, parksdensity2, art.Buffer800, school_1, massage.Buffer1000, storage.Buffer600, Assault600.Buffer, Assault_1, slope, ArtEnt600, Construction600, Private600, perc_Non_Hisp, Perc_Poverty, Median_Income, lagPrice)
stargazer(as.data.frame(summary_table), type="text", title = "Summary Statistics")
##
## Summary Statistics
## =======================================================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -------------------------------------------------------------------------------------------------------
## PropArea 10,128 1,644.283 711.427 187 1,160 1,970 7,679
## Baths 10,128 1.789 0.969 0 1 2 8
## Beds 10,128 2.134 1.280 1 1 3 10
## Stories 10,128 1.391 0.611 1 1 2 10
## SaleYr 10,128 13.449 1.106 12 12 14 15
## tree.Buffer100 10,128 68.541 45.945 0 32 101 338
## transit600.Buffer 10,128 36.759 13.625 4 27 44 111
## roadlength_200 10,128 2,423.779 585.640 509.401 2,007.884 2,735.410 6,811.748
## parksdensity3 10,128 32,437.878 32,957.989 0.000 7,502.254 50,069.713 242,067.393
## parksdensity2 10,128 637.100 1,212.479 0.000 0.000 873.419 7,974.879
## art.Buffer800 10,128 11.378 15.028 0 2 16 124
## school_1 10,128 277.109 145.789 7.407 165.826 367.150 1,015.135
## massage.Buffer1000 10,128 3.842 4.385 0 1 5 59
## storage.Buffer600 10,128 1.414 2.514 0 0 2 29
## Assault600.Buffer 10,128 364.708 402.054 2 128 441 6,624
## Assault_1 10,128 59.500 36.958 4.200 33.771 74.583 380.836
## slope 10,128 5.178 4.153 0 2 8 25
## ArtEnt600 10,128 40.628 31.854 0 19 56 219
## Construction600 10,128 51.834 21.019 0 36 65 129
## Private600 10,128 40.291 54.131 0 11 46 474
## perc_Non_Hisp 10,128 0.850 0.107 0.377 0.786 0.927 0.983
## Perc_Poverty 10,128 0.151 0.046 0.010 0.129 0.178 0.436
## Median_Income 10,128 44,215.342 14,429.998 12,320 34,770 45,964 100,974
## lagPrice 10,128 1,068,575.261 541,001.486 104,001.400 675,176.850 1,315,159.550 4,283,001.600
## -------------------------------------------------------------------------------------------------------
As the following matrix demonstrated, the 23 numeric variables indicated relatively minor collinearity issue, and each showed positive or negative relationships with the final home sale price.
houseprice.sf3$slope <- as.numeric(as.character(houseprice.sf3$slope))
houseprice.df3 <- as.data.frame(houseprice.sf3)
houseprice.df3<-
houseprice.df3%>%
filter(SalePrice<=4000000&SalePrice >=300000) %>%
dplyr::select(SalePrice, tree.Buffer100, transit600.Buffer, roadlength_200, parksdensity3, parksdensity2, art.Buffer800, school_1, massage.Buffer1000, storage.Buffer600, Assault600.Buffer, Assault_1, PropArea, Baths, Beds, Stories, slope, ArtEnt600, Construction600, Private600, perc_Non_Hisp, Perc_Poverty, Median_Income, lagPrice)
#unlist(lapply(houseprice.df3, class))
M <- cor(houseprice.df3)
#corrplot(M, method = "number")
matrix
To better illustrate the correlations between home price and chosen variables, we plotted the following scatterplots. Note that despite people’s preference to cultural, environmental and safety amenities, internal characteristics remain the most influential factor category on home price prediction.
#ArtEnt600 = number of "Arts, Entertainment, and Recreation bussiness" within 600m buffer
#tree.Buffer100 = number of trees within 100m buffer
#Assault_1 = the distance to the nearst assault crime
#proparea = property area
holdout <- subset(houseprice.sf3, houseprice.sf3$SalePrice>0, )
as.data.frame(holdout)%>%
dplyr::select(SalePrice, tree.Buffer100, Assault_1, PropArea, ArtEnt600) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_point() +
geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable,ncol=2,scales="free") +
labs(title = "Price & 4 related variable") +
theme(axis.text.x = element_blank()) +
plotTheme()
Similarly, we examined the correlations between home price and chosen variables by plotting them spatially on the base map of San Francisco. It is clear that high-value houses clustered in the central neighborhoods and along north-eastern coasts, which is in accordance with the clustering of art and entertainment facilities. On the other hand, the distribution of construction business and assault crimes indicated an opposing trend in this region, which obviously explains why people are reluctant to reside in certain areas.
#only saleprice holdout=0
ggplot() +
geom_sf(data = sf_basemap, fill = "grey40") +
geom_sf(data = holdout, aes(colour = q5(SalePrice)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(holdout,"SalePrice"),
name="Quintile\nBreaks") +
labs(title="Observed sale prices") +
mapTheme()
#holdout
ggplot() +
geom_sf(data = sf_basemap, fill = "grey40") +
geom_sf(data = holdout, aes(colour = q5(ArtEnt600)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(holdout,"ArtEnt600"),
name="Quintile\nBreaks") +
labs(title="Number of Art Entertainment within 600m buffer") +
mapTheme()
ggplot() +
geom_sf(data = sf_basemap, fill = "grey40") +
geom_sf(data = holdout, aes(colour = q5(Construction600)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(holdout,"Construction600"),
name="Quintile\nBreaks") +
labs(title="Number of Construction Bussiness within 600m buffer") +
mapTheme()
ggplot() +
geom_sf(data = sf_basemap, fill = "grey40") +
geom_sf(data = holdout, aes(colour = q5(Assault_1)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(holdout,"Assault_1"),
name="Quintile\nBreaks") +
labs(title="Distance to nearest Assault crime") +
mapTheme()
The goal of our geospatial prediction is to borrow the experience from existing house transactions and test the extent to which that experience generalizes to predicting the values of unsold houses. As discussed above, our model is an abstraction of each home site’s reality using carefully constructed features. When estimating the model, we focused on OLS regression in our model because it is less opaque and computationally more efficient than other more advanced machine learning algorithms. We then validate the accuracy and generalizability of the model by dividing Zillow’s transaction dataset into “training set” and “test set”, thus simulating the process of predicting unknown home price using feature patterns trained from existing home prices. By examining the absolute value, percentage and spatial distribution of errors in the test set, we are able to tell whether the model generalize well to new data as well as to different spatial contexts.
Since a “good” model is one with accurate predictions for both existing dataset and new dataset, we randomly divided current sold homes into training set (60%) and test set (40%). Below is the main model results for the training set and prediction results for the test set.
The adjusted R-squared of the training set was 0.708, while the test set adjusted R-squared reached 0.722. This result indicated that the chosen predicting variables can explain more than 70% of variation in home price, which shows the good interpretability of our model. In addition, the F-statistic had a very small p-value, thus we can announce with confidence our model fits the datasets well and can offer meaningful explanations to house price variations.
As for the accuracy and generalizability estimators, the Mean Absolute Error (MAE) of test set is 248008, meaning that our prediction of home value can be around 250000 dollars above or below the real price. Although this is not a trivial amount of error, we need to take the capital-intensive nature of real estate business into account. The Mean Absolute Percent Error (MAPE) is only 24.68 %, meaning that our predicted house prices do not significantly deviate from observed house prices (less than a quarter).
#R^2 table, traning set
holdout <- subset(houseprice.sf3, houseprice.sf3$SalePrice>0, )
set.seed(12321)
inTrain <- createDataPartition(
y = paste(holdout$GEOID, holdout$SaleYr,holdout$nhood),
p = .60, list = FALSE)
sanf.training <- holdout[inTrain,]
sanf.test <-holdout[-inTrain,]
reg_train <- lm(SalePrice~tree.Buffer100+transit600.Buffer+roadlength_200+parksdensity3+parksdensity2+art.Buffer800+school_1+massage.Buffer1000+storage.Buffer600+Assault600.Buffer+ Assault_1+ PropArea+Baths+Beds+Stories+ slope+ArtEnt600+Construction600+ Private600+ perc_Non_Hisp+ Perc_Poverty+ year+Median_Income+ lagPrice+SaleYr+nhood+GEOID,data=sanf.training)
sum <- summary(reg_train)$coef[1:32,]%>% round(5)
sum
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5027850.18006 27525554.46732 -0.18266 0.85507
## tree.Buffer100 1452.76027 188.55280 7.70479 0.00000
## transit600.Buffer -400.01380 821.15046 -0.48714 0.62618
## roadlength_200 -58.48123 11.64654 -5.02134 0.00000
## parksdensity3 0.12894 0.23863 0.54032 0.58900
## parksdensity2 -19.93993 6.66470 -2.99187 0.00278
## art.Buffer800 -1235.43272 697.06021 -1.77235 0.07639
## school_1 57.56515 44.33101 1.29853 0.19416
## massage.Buffer1000 716.79037 3946.23974 0.18164 0.85587
## storage.Buffer600 -2343.44855 4224.75120 -0.55470 0.57912
## Assault600.Buffer -123.89267 42.73125 -2.89935 0.00375
## Assault_1 437.54657 167.20285 2.61686 0.00890
## PropArea 384.68110 11.37982 33.80378 0.00000
## Baths -1491.43499 8289.00818 -0.17993 0.85721
## Beds -2094.65995 5280.62877 -0.39667 0.69163
## Stories 79362.22392 10937.80018 7.25578 0.00000
## slope 6063.37417 1906.37092 3.18058 0.00148
## ArtEnt600 2429.29656 674.20599 3.60320 0.00032
## Construction600 -706.05338 526.24356 -1.34169 0.17975
## Private600 -796.69676 320.04966 -2.48929 0.01283
## perc_Non_Hisp -631499.09518 4583461.78237 -0.13778 0.89042
## Perc_Poverty -1096805.76108 9520683.57215 -0.11520 0.90829
## year1890-1910 -37725.18854 47313.56383 -0.79734 0.42528
## year1910-1930 -41457.80715 47293.34868 -0.87661 0.38074
## year1930-1950 -4971.07496 47754.85694 -0.10410 0.91710
## year1950-1970 -51002.66328 49045.51185 -1.03990 0.29843
## year1970-1990 -258074.03827 52571.16364 -4.90904 0.00000
## year1990-2010 -146312.64756 53059.18932 -2.75754 0.00584
## Median_Income 77.40336 749.22122 0.10331 0.91772
## lagPrice 0.22620 0.01696 13.33707 0.00000
## SaleYr 164603.26456 4526.54111 36.36403 0.00000
## nhoodBayview Hunters Point -45233.46070 1101419.37312 -0.04107 0.96724
#Residual standard error: 378386 on 5713 degrees of freedom
#Multiple R-squared: 0.7188498, Adjusted R-squared: 0.7081708
#F-statistic: 67.31383 on 217 and 5713 DF, p-value: < 0.00000000000000022204
sanf.test <-
sanf.test %>%
mutate(Regreesion="Baseline Regression",
SalePrice.Predict = predict(reg_train, sanf.test),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice)%>%
filter(SalePrice < 6000000)
sanf.test.df <- dplyr::select(as.data.frame(sanf.test), -geometry)
reg_test <- lm(SalePrice~tree.Buffer100+transit600.Buffer+roadlength_200+parksdensity3+parksdensity2+art.Buffer800+school_1+massage.Buffer1000+storage.Buffer600+Assault600.Buffer+ Assault_1+ PropArea+Baths+Beds+Stories+ slope+ArtEnt600+Construction600+ Private600+ perc_Non_Hisp+ Perc_Poverty+ year+MeIncom+ lagPrice+SaleYr+nhood+GEOID,data=sanf.test)
MAE <- sanf.test.df%>%
summarize(mean(SalePrice.AbsError, na.rm = T)) %>%
round(3)%>%
pull()
MAPE <- sanf.test.df%>%
summarize(mean(SalePrice.APE, na.rm = T) * 100) %>%
round(2) %>%
paste("%")
x <- data.frame("MAE"=MAE,
"MAPE"=MAPE,
"Adjusted R^2"=0.722)
x%>%
kable() %>%
kable_styling("striped", full_width = F)
| MAE | MAPE | Adjusted.R.2 |
|---|---|---|
| 248008.699 | 24.68 % | 0.722 |
To further examine if our model is over-fit to the training data, we did 100 folds of the cross-validation tests on the training set and plot your cross-validation MAE as a histogram. A value of 34868 standard deviation suggests relatively little variation across folds, given the mean MAE is 255616 (very similar to test set MAE). The tightly clustered distribution of errors also indicated comparable goodness of fit metrics across each fold.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(1314)
reg4.cv <- train(SalePrice ~ ., data = as.data.frame(sanf.training) %>%
dplyr::select(SalePrice, tree.Buffer100, transit600.Buffer, roadlength_200, parksdensity3, parksdensity2, art.Buffer800, school_1, massage.Buffer1000, storage.Buffer600, Assault600.Buffer, Assault_1, PropArea, Baths, Beds, Stories, slope, ArtEnt600, Construction600, Private600, perc_Non_Hisp, Perc_Poverty, year, Median_Income, lagPrice, SaleYr, nhood ),
method = "lm", trControl = fitControl, na.action = na.pass)
mean <- mean(reg4.cv$resample[,3])%>%round(2)
std <- sd(reg4.cv$resample[,3])%>%round(2)
y <- data.frame(
"Std"=std,
"Mean"=mean,
"R^2"=0.70)
y%>%
kable() %>%
kable_styling("striped", full_width = F)
| Std | Mean | R.2 |
|---|---|---|
| 34834.48 | 255571.98 | 0.7 |
#histogram
ggplot(as.data.frame(reg4.cv$resample), aes(MAE)) +
geom_histogram(bins = 50, colour="white", fill = "#FA7800") +
labs(title="Distribution of MAE", subtitle = "k-fold cross validation; k = 100",
x="Mean Absolute Error", y="Count") +
plotTheme()
We then further looked into our prediction accuracy and generalizability by plotting predicted prices as a function of observed prices. The result indicated that despite general acceptable predictive power, our model does tend to underpredict the most expensive homes. This may be caused by insufficient sampling of such houses in our database.
ggplot(sanf.test, aes( SalePrice, SalePrice.Predict)) +
geom_point(colour = "#FA7800") +
geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
geom_smooth(data=sanf.test, aes(SalePrice,SalePrice), method = "lm", se = FALSE, colour = "red")+
labs(title = "Predict Price as a function of observed prices",
caption = "Red line represents a perfect prediction; Green line represents our prediction",
x = "Observed Sale Price",
y = "Predicted Sale Price") +
plotTheme()
Next, we examined our model’s generalizability over space. The map of our test set’s residuals (or errors) indicated a random-distribution pattern, which suggests the spatial variation related to the underlying structure of home prices has been mostly covered by our model. The result of Moran’s I test also showed that the errors from our model exhibit little spatial autocorrelation.
ggplot() +
geom_sf(data = sf_basemap, fill = "grey40") +
geom_sf(data = sanf.test, aes(colour = q5(SalePrice.Error)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(sanf.test,"SalePrice.Error"),
name="Quintile\nBreaks") +
labs(title="Residuals of the test set") +
mapTheme()
#moran's I
coords.test <-
filter(sanf.test, !is.na(SalePrice.Error)) %>%
st_coordinates()
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
moranTest <- moran.mc(filter(sanf.test, !is.na(SalePrice.Error))$SalePrice.Error,
spatialWeights.test , nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in red",
x="Moran's I",
y="Count",
caption="Public Policy Analytics, Figure 6.8") +
plotTheme()
Having confirmed the accuracy and generalizability of our model, we provide a map of our final predicted values for the entire dataset (both sold and hold-out homes). This predicted home value map showed similar spatial features as the observed home value map plotted in the Data Section, indicating the good predictive power of our model.
reg_holdout <- lm(SalePrice~tree.Buffer100+transit600.Buffer+roadlength_200+parksdensity3+parksdensity2+art.Buffer800+school_1+massage.Buffer1000+storage.Buffer600+Assault600.Buffer+ Assault_1+ PropArea+Baths+Beds+Stories+ slope+ArtEnt600+Construction600+ Private600+ perc_Non_Hisp+ Perc_Poverty+ year+Median_Income+ lagPrice+SaleYr+nhood+GEOID,data=holdout)
predict_all <-
houseprice.sf3 %>%
mutate(
SalePrice.Predict = predict(reg_holdout, houseprice.sf3),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice)%>%
filter(SalePrice < 6000000)
ggplot() +
geom_sf(data = sf_basemap, fill = "grey40") +
geom_sf(data = predict_all, aes(colour = q5(SalePrice.Predict)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(predict_all,"SalePrice.Predict"),
name="Quintile\nBreaks") +
labs(title="All Predict Sale Prices") +
mapTheme()
Last but not least, we checked the model’s generalizability over different urban contexts. Using the test set predictions, we plotted a map of mean absolute percentage error (MAPE) by neighborhood, as well as by tracts. We also plotted scatterplot plots of MAPE by neighborhood (or by tracts) as a function of mean price by neighborhood (or by tracts). It’s clear that the Tract Effects model’s accuracy is more consistent across space than the Neighborhood Effects model, which suggests the former is more generalizable. This was the reason why we included tract ID instead of neighborhood names as spatial fixed effect indicator in our final model.
sanf.test.nhoods <-
sanf.test %>%
group_by(nhood) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T),
mean.SalePrice = mean(SalePrice, na.rm = T)) %>%
st_set_geometry(NULL)%>%
left_join(neighourhood)%>%
st_sf()
#by neighbourhood
ggplot() +
geom_sf(data = sanf.test.nhoods, aes(fill = q5(mean.MAPE))) +
geom_sf(data = sanf.test, colour="black", size = .3) +
scale_fill_manual(values = palette5,
labels=qBr(sanf.test.nhoods,"mean.MAPE",rnd=F),
name="Quintile\nBreaks") +
labs(title="Mean test set MAPE by neighbourhood") +
mapTheme()
#by tract
tract <-
get_acs(geography = "tract", variables = c("B01003_001"),
year = 2015, state=06, county=075, geometry=T) %>%
st_transform(102286) %>%
dplyr::select(GEOID,geometry)
sanf.test.tract <-
sanf.test %>%
group_by(GEOID) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T),
mean.SalePrice = mean(SalePrice, na.rm = T)) %>%
st_set_geometry(NULL)%>%
left_join(tract)%>%
st_sf()
ggplot() +
geom_sf(data = sanf.test.tract, aes(fill = q5(mean.MAPE))) +
geom_sf(data = sanf.test, colour="black", size = .3) +
scale_fill_manual(values = palette5,
labels=qBr(sanf.test.tract,"mean.MAPE",rnd=F),
name="Quintile\nBreaks") +
labs(title="Mean test set MAPE by Tract") +
mapTheme()
# by tract
ggplot(sanf.test.tract, aes( mean.SalePrice, mean.MAPE)) +
geom_point(colour = "#FA7800") +
geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
labs(title = "MAPE by tract as a function of Mean Price by tract",
x = "mean.SalePrice",
y = "mean.MAPE") +
plotTheme()
#by neighbourhood
ggplot(sanf.test.nhoods, aes( mean.SalePrice, mean.MAPE)) +
geom_point(colour = "#FA7800") +
geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
labs(title = "MAPE by tract as a function of Mean Price by neighbourhood",
x = "mean.SalePrice",
y = "mean.MAPE") +
plotTheme()
We also used census data to split San Francisco into two groups (by race and by income) to test your model’s generalizability. As the following table indicated, our model generalizes well in different income contexts, yet does not perform as well in different racial contexts. However, since the bias between groups is relatively small (less than 5%), this model still offers a good reference for commercial trades and policy making.
#by racial
sanf.test %>%
group_by(raceContext) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
st_set_geometry(NULL) %>%
spread(raceContext, mean.MAPE) %>%
kable(caption = "Mean Absolute Error of test set by racial context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "#25CB10")
| Majority Non-White | Majority White |
|---|---|
| 0.2275922545 | 0.2700526666 |
#by income
sanf.test %>%
group_by(incomeContext) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
st_set_geometry(NULL) %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = "Mean Absolute Error of test set by racial context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "#25CB10")
| High Income | Low Income |
|---|---|
| 0.2461552914 | 0.246963671 |
Given the above examinations and validations, our model is effective and generable in predicting home values in San Francisco. Overall, it can explain and predict more than 70% of variation in house sale prices. The maps comparing predicted and observed house prices also indicated that our model offered convincible explanations and predictions of the spatial variation in home values. According to parameters and t-values, the most important predictor variables in our model are number of tree (in 100m buffer), art and entertainment facilities (in 600m buffer), stories, built year, sale year, slope and tract ID (tract fixed effect), each stand for different aspects of internal characteristics, amenities and spatial structure. Despite all efforts to balance between accuracy and generalizability, our prediction results still have a 248009 dollars deviation from the real price on average, which is less than 25% in terms of final sale price. As for spatial aspect, the model generalizes well across space and different urban contexts, especially for the majority of sales with moderate transaction price. However, its predictive power is less satisfying for the cheapest and most expensive homes. This shortcoming is most likely to be caused by a lack of representative samples in the training data, which can be improved by expanding sample size and optimizing sample structure.
To sum up, we strongly recommend that Zillow adopt our methodology to improve the accuracy and generalizability of predicting house price in San Francisco, thus facilitating related business and administrative processes such as taxing. Although the model can be further improved by introducing more detailed and high-quality inherent characteristics related variables as well as adopting more powerful predictive algorithms, it has proven itself as a sufficiently predictive and generalizable home price prediction tool and provided a good starting point for future enhancements and generalization.