1 Introduction

Home price prediction is the cornerstone of Zillow’s real estate database business. Accurate and timely home price prediction is fundamental for buyers and sellers to establish mutually beneficial trades, as well as for government agencies to calculate tax liability and work out related policies. Although Zillow has constructed impressive prediction models for a number of cities in the US, its housing market predictions in San Francisco aren’t as accurate as they could be because they do not factor in enough local intelligence. Thus, we are pleased to present a new machine-learning model of better predictive and generalizable power in home price prediction, which will guarantee Zillow’s leading position in this highly profitable and influential field.

The nature of our machine learning model is to borrow the experience from recently transacted properties to predict the values of homes that have not yet been sold. The main challenge in building this model, as in building any other machine-learning model, is to summarize the best predictive variables (or ‘features’) from current sold-out-homes dataset (or ‘training set’) to generate enough predictive power, while avoiding over-fit on the training set to ensure its generalizability to new unsold-homes dataset (or ‘test set’). We also paid special attention on the prediction algorithm’s generalizability to different urban contexts. A lack of such generalizability may lead to systematically over or under predict in certain neighborhoods, which will not only impair analytic accuracy in real estate market, but also cause huge public policy problems. After much efforts, we believe that we have well managed the trade-off between accuracy and generalizability (or the ‘bias/variance trade-off’), and created a model that is predictive and generalizable to the many different places in and around San Francisco.

The following report is organized into 4 main parts: the source and processing of data, the modeling method and validation, the modelling result and its generalizability, and final remarks.

2 Data

Our primary dataset is an information list of 9433 sold homes (with sale price) and 700 hold-homes (without sale price) randomly selected in San Francisco from 2012-2015 by Zillow. We also collected data from U.S. Census website and San Francisco’s Open Data portal, careful not to include any dataset that might include sales information or be derived from sales information (e.g. property tax dataset).

According to the hedonic model, there are 3 types of data that we used to predict San Francisco’s home price: internal characteristics, amenities and spatial structure.

We selected the number of bedrooms, bathrooms, stories, property area, age of the property and Sale Date from Zillow’s dataset as the main internal characteristics. We interpreted some original data with statistical techniques, as well as adjusted unreasonable values such as 0 stories and 52 bedrooms.

For amenities, we used “the distance between house and services” to measure the closeness of house to basic public services such as school and health care facilities as well as used “the quantity within buffer” to measure the density and diversity of different types of businesses, public art, tree, parks, public transits and road length (i.e. accessibility for private vehicles) surrounding the property. To find out which are the best business types to include as deciding factors of house price, we first run the regression between house price and different types of business, then selected the most significant predictors: massage, storage, art and entertainment, construction, private education and health services. Since different types of amenities intended to serve on different spatial scale, we adopted different sizes of buffer to measure different services.

For the spatial aspect, we calculated the slope map of San Francisco through the elevation data since it would impact the walkability and cyclability of neighborhoods, thus had an impact the house price. We also used “the distance to the nearest assault crime” as the safety-related variable. Apart from that, we added several variables from census data into our model to restore San Francisco’s spatial characteristics, including “percentage of people in povert”, “percentage of non-hispanic” and “median household income”. Additionally, we included the average sale price of the 5 nearest homes (“lagPrice”) into our model since the results of Moran’s I examination indicated that the value of a certain house is significantly related to the values of surrounding properties. More specific spatial structure variables such as Neighborhood and census tract are also considered in our model.

Following the machine learning methodology, we compiled and wrangled the raw data we collected into a consistent dataset for all 10133 home sites. After exploratory analysis and feature engineering, we settled on 27 most significant and inter-independent predictors of Internal characteristics, amenities and spatial structure:

summary_table <- houseprice.sf3%>%
dplyr::select(PropArea, Baths, Beds, Stories, year, SaleYr,
              tree.Buffer100, transit600.Buffer, roadlength_200, parksdensity3, parksdensity2, art.Buffer800, school_1, massage.Buffer1000, storage.Buffer600, Assault600.Buffer, Assault_1, slope, ArtEnt600, Construction600, Private600, perc_Non_Hisp, Perc_Poverty, Median_Income, lagPrice)
stargazer(as.data.frame(summary_table), type="text", title = "Summary Statistics")
## 
## Summary Statistics
## =======================================================================================================
## Statistic            N        Mean       St. Dev.       Min      Pctl(25)     Pctl(75)         Max     
## -------------------------------------------------------------------------------------------------------
## PropArea           10,128   1,644.283     711.427       187        1,160        1,970         7,679    
## Baths              10,128     1.789        0.969         0           1            2             8      
## Beds               10,128     2.134        1.280         1           1            3            10      
## Stories            10,128     1.391        0.611         1           1            2            10      
## SaleYr             10,128    13.449        1.106        12          12           14            15      
## tree.Buffer100     10,128    68.541       45.945         0          32           101           338     
## transit600.Buffer  10,128    36.759       13.625         4          27           44            111     
## roadlength_200     10,128   2,423.779     585.640     509.401    2,007.884    2,735.410     6,811.748  
## parksdensity3      10,128  32,437.878   32,957.989     0.000     7,502.254   50,069.713    242,067.393 
## parksdensity2      10,128    637.100     1,212.479     0.000       0.000       873.419      7,974.879  
## art.Buffer800      10,128    11.378       15.028         0           2           16            124     
## school_1           10,128    277.109      145.789      7.407      165.826      367.150      1,015.135  
## massage.Buffer1000 10,128     3.842        4.385         0           1            5            59      
## storage.Buffer600  10,128     1.414        2.514         0           0            2            29      
## Assault600.Buffer  10,128    364.708      402.054        2          128          441          6,624    
## Assault_1          10,128    59.500       36.958       4.200      33.771       74.583        380.836   
## slope              10,128     5.178        4.153         0           2            8            25      
## ArtEnt600          10,128    40.628       31.854         0          19           56            219     
## Construction600    10,128    51.834       21.019         0          36           65            129     
## Private600         10,128    40.291       54.131         0          11           46            474     
## perc_Non_Hisp      10,128     0.850        0.107       0.377       0.786        0.927         0.983    
## Perc_Poverty       10,128     0.151        0.046       0.010       0.129        0.178         0.436    
## Median_Income      10,128  44,215.342   14,429.998    12,320      34,770       45,964        100,974   
## lagPrice           10,128 1,068,575.261 541,001.486 104,001.400 675,176.850 1,315,159.550 4,283,001.600
## -------------------------------------------------------------------------------------------------------

As the following matrix demonstrated, the 23 numeric variables indicated relatively minor collinearity issue, and each showed positive or negative relationships with the final home sale price.

houseprice.sf3$slope <- as.numeric(as.character(houseprice.sf3$slope))

houseprice.df3 <- as.data.frame(houseprice.sf3)
 houseprice.df3<- 
  houseprice.df3%>% 
  filter(SalePrice<=4000000&SalePrice >=300000) %>%
  dplyr::select(SalePrice, tree.Buffer100, transit600.Buffer, roadlength_200, parksdensity3, parksdensity2, art.Buffer800, school_1, massage.Buffer1000, storage.Buffer600, Assault600.Buffer, Assault_1, PropArea, Baths, Beds, Stories, slope, ArtEnt600, Construction600, Private600, perc_Non_Hisp, Perc_Poverty, Median_Income, lagPrice)

#unlist(lapply(houseprice.df3, class))
M <- cor(houseprice.df3)
#corrplot(M, method = "number")
matrix

matrix

To better illustrate the correlations between home price and chosen variables, we plotted the following scatterplots. Note that despite people’s preference to cultural, environmental and safety amenities, internal characteristics remain the most influential factor category on home price prediction.

#ArtEnt600 = number of "Arts, Entertainment, and Recreation bussiness" within 600m buffer 
#tree.Buffer100 = number of trees within 100m buffer  
#Assault_1 = the distance to the nearst assault crime 
#proparea = property area 

holdout <- subset(houseprice.sf3, houseprice.sf3$SalePrice>0, )

as.data.frame(holdout)%>%
  dplyr::select(SalePrice, tree.Buffer100, Assault_1, PropArea, ArtEnt600) %>%
   gather(Variable, Value, -SalePrice) %>% 
ggplot(aes(Value, SalePrice)) +
     geom_point() + 
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable,ncol=2,scales="free") +
     labs(title = "Price & 4 related variable") +
     theme(axis.text.x = element_blank()) +
     plotTheme()

Similarly, we examined the correlations between home price and chosen variables by plotting them spatially on the base map of San Francisco. It is clear that high-value houses clustered in the central neighborhoods and along north-eastern coasts, which is in accordance with the clustering of art and entertainment facilities. On the other hand, the distribution of construction business and assault crimes indicated an opposing trend in this region, which obviously explains why people are reluctant to reside in certain areas.

#only saleprice holdout=0
ggplot() + 
    geom_sf(data = sf_basemap, fill = "grey40") + 
    geom_sf(data = holdout, aes(colour = q5(SalePrice)), show.legend = "point", size = 1) +
    scale_colour_manual(values = palette5, 
                        labels=qBr(holdout,"SalePrice"),
                        name="Quintile\nBreaks") +
    labs(title="Observed sale prices") +
    mapTheme()

#holdout
ggplot() +
    geom_sf(data = sf_basemap, fill = "grey40") +
    geom_sf(data = holdout, aes(colour = q5(ArtEnt600)), show.legend = "point", size = 1) +
    scale_colour_manual(values = palette5,
                        labels=qBr(holdout,"ArtEnt600"),
                        name="Quintile\nBreaks") +
    labs(title="Number of Art Entertainment within 600m buffer") +
    mapTheme()

ggplot() +
    geom_sf(data = sf_basemap, fill = "grey40") +
    geom_sf(data = holdout, aes(colour = q5(Construction600)), show.legend = "point", size = 1) +
    scale_colour_manual(values = palette5,
                        labels=qBr(holdout,"Construction600"),
                        name="Quintile\nBreaks") +
    labs(title="Number of Construction Bussiness within 600m buffer") +
    mapTheme()

ggplot() +
    geom_sf(data = sf_basemap, fill = "grey40") +
    geom_sf(data = holdout, aes(colour = q5(Assault_1)), show.legend = "point", size = 1) +
    scale_colour_manual(values = palette5,
                        labels=qBr(holdout,"Assault_1"),
                        name="Quintile\nBreaks") +
    labs(title="Distance to nearest Assault crime") +
    mapTheme()

3 Methods

The goal of our geospatial prediction is to borrow the experience from existing house transactions and test the extent to which that experience generalizes to predicting the values of unsold houses. As discussed above, our model is an abstraction of each home site’s reality using carefully constructed features. When estimating the model, we focused on OLS regression in our model because it is less opaque and computationally more efficient than other more advanced machine learning algorithms. We then validate the accuracy and generalizability of the model by dividing Zillow’s transaction dataset into “training set” and “test set”, thus simulating the process of predicting unknown home price using feature patterns trained from existing home prices. By examining the absolute value, percentage and spatial distribution of errors in the test set, we are able to tell whether the model generalize well to new data as well as to different spatial contexts.

4 Results

Since a “good” model is one with accurate predictions for both existing dataset and new dataset, we randomly divided current sold homes into training set (60%) and test set (40%). Below is the main model results for the training set and prediction results for the test set.

The adjusted R-squared of the training set was 0.708, while the test set adjusted R-squared reached 0.722. This result indicated that the chosen predicting variables can explain more than 70% of variation in home price, which shows the good interpretability of our model. In addition, the F-statistic had a very small p-value, thus we can announce with confidence our model fits the datasets well and can offer meaningful explanations to house price variations.

As for the accuracy and generalizability estimators, the Mean Absolute Error (MAE) of test set is 248008, meaning that our prediction of home value can be around 250000 dollars above or below the real price. Although this is not a trivial amount of error, we need to take the capital-intensive nature of real estate business into account. The Mean Absolute Percent Error (MAPE) is only 24.68 %, meaning that our predicted house prices do not significantly deviate from observed house prices (less than a quarter).

#R^2 table, traning set 
holdout <- subset(houseprice.sf3, houseprice.sf3$SalePrice>0, )
set.seed(12321)
inTrain <- createDataPartition(
  y = paste(holdout$GEOID, holdout$SaleYr,holdout$nhood), 
  p = .60, list = FALSE)

sanf.training <- holdout[inTrain,] 
sanf.test <-holdout[-inTrain,]  


reg_train <- lm(SalePrice~tree.Buffer100+transit600.Buffer+roadlength_200+parksdensity3+parksdensity2+art.Buffer800+school_1+massage.Buffer1000+storage.Buffer600+Assault600.Buffer+ Assault_1+ PropArea+Baths+Beds+Stories+ slope+ArtEnt600+Construction600+ Private600+ perc_Non_Hisp+ Perc_Poverty+ year+Median_Income+ lagPrice+SaleYr+nhood+GEOID,data=sanf.training)


sum <- summary(reg_train)$coef[1:32,]%>% round(5)
sum
##                                  Estimate     Std. Error  t value Pr(>|t|)
## (Intercept)                -5027850.18006 27525554.46732 -0.18266  0.85507
## tree.Buffer100                 1452.76027      188.55280  7.70479  0.00000
## transit600.Buffer              -400.01380      821.15046 -0.48714  0.62618
## roadlength_200                  -58.48123       11.64654 -5.02134  0.00000
## parksdensity3                     0.12894        0.23863  0.54032  0.58900
## parksdensity2                   -19.93993        6.66470 -2.99187  0.00278
## art.Buffer800                 -1235.43272      697.06021 -1.77235  0.07639
## school_1                         57.56515       44.33101  1.29853  0.19416
## massage.Buffer1000              716.79037     3946.23974  0.18164  0.85587
## storage.Buffer600             -2343.44855     4224.75120 -0.55470  0.57912
## Assault600.Buffer              -123.89267       42.73125 -2.89935  0.00375
## Assault_1                       437.54657      167.20285  2.61686  0.00890
## PropArea                        384.68110       11.37982 33.80378  0.00000
## Baths                         -1491.43499     8289.00818 -0.17993  0.85721
## Beds                          -2094.65995     5280.62877 -0.39667  0.69163
## Stories                       79362.22392    10937.80018  7.25578  0.00000
## slope                          6063.37417     1906.37092  3.18058  0.00148
## ArtEnt600                      2429.29656      674.20599  3.60320  0.00032
## Construction600                -706.05338      526.24356 -1.34169  0.17975
## Private600                     -796.69676      320.04966 -2.48929  0.01283
## perc_Non_Hisp               -631499.09518  4583461.78237 -0.13778  0.89042
## Perc_Poverty               -1096805.76108  9520683.57215 -0.11520  0.90829
## year1890-1910                -37725.18854    47313.56383 -0.79734  0.42528
## year1910-1930                -41457.80715    47293.34868 -0.87661  0.38074
## year1930-1950                 -4971.07496    47754.85694 -0.10410  0.91710
## year1950-1970                -51002.66328    49045.51185 -1.03990  0.29843
## year1970-1990               -258074.03827    52571.16364 -4.90904  0.00000
## year1990-2010               -146312.64756    53059.18932 -2.75754  0.00584
## Median_Income                    77.40336      749.22122  0.10331  0.91772
## lagPrice                          0.22620        0.01696 13.33707  0.00000
## SaleYr                       164603.26456     4526.54111 36.36403  0.00000
## nhoodBayview Hunters Point   -45233.46070  1101419.37312 -0.04107  0.96724
#Residual standard error: 378386 on 5713 degrees of freedom
#Multiple R-squared:  0.7188498,    Adjusted R-squared:  0.7081708 
#F-statistic: 67.31383 on 217 and 5713 DF,  p-value: < 0.00000000000000022204
sanf.test <-
  sanf.test %>%
  mutate(Regreesion="Baseline Regression",
    SalePrice.Predict = predict(reg_train, sanf.test),
         SalePrice.Error = SalePrice - SalePrice.Predict,
         SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice)%>%
  filter(SalePrice < 6000000)

sanf.test.df <- dplyr::select(as.data.frame(sanf.test), -geometry)

reg_test <- lm(SalePrice~tree.Buffer100+transit600.Buffer+roadlength_200+parksdensity3+parksdensity2+art.Buffer800+school_1+massage.Buffer1000+storage.Buffer600+Assault600.Buffer+ Assault_1+ PropArea+Baths+Beds+Stories+ slope+ArtEnt600+Construction600+ Private600+ perc_Non_Hisp+ Perc_Poverty+ year+MeIncom+ lagPrice+SaleYr+nhood+GEOID,data=sanf.test)


MAE <- sanf.test.df%>%
  summarize(mean(SalePrice.AbsError, na.rm = T)) %>%
  round(3)%>%
  pull()

MAPE <- sanf.test.df%>%
  summarize(mean(SalePrice.APE, na.rm = T) * 100) %>%
  round(2) %>%
  paste("%")

x <- data.frame("MAE"=MAE,
                "MAPE"=MAPE,
                "Adjusted R^2"=0.722)
x%>%
kable() %>%
    kable_styling("striped", full_width = F)
MAE MAPE Adjusted.R.2
248008.699 24.68 % 0.722

To further examine if our model is over-fit to the training data, we did 100 folds of the cross-validation tests on the training set and plot your cross-validation MAE as a histogram. A value of 34868 standard deviation suggests relatively little variation across folds, given the mean MAE is 255616 (very similar to test set MAE). The tightly clustered distribution of errors also indicated comparable goodness of fit metrics across each fold.

fitControl <- trainControl(method = "cv", number = 100)

set.seed(1314)

reg4.cv <- train(SalePrice ~ ., data = as.data.frame(sanf.training) %>% 
                   dplyr::select(SalePrice, tree.Buffer100, transit600.Buffer, roadlength_200, parksdensity3, parksdensity2, art.Buffer800, school_1, massage.Buffer1000, storage.Buffer600, Assault600.Buffer,  Assault_1,  PropArea, Baths, Beds, Stories,  slope, ArtEnt600, Construction600,  Private600,  perc_Non_Hisp,  Perc_Poverty,  year, Median_Income,  lagPrice, SaleYr, nhood ), 
                 method = "lm", trControl = fitControl, na.action = na.pass)

mean <- mean(reg4.cv$resample[,3])%>%round(2)
std <- sd(reg4.cv$resample[,3])%>%round(2)

y <- data.frame(
                "Std"=std,
                "Mean"=mean,
                "R^2"=0.70)

y%>%
kable() %>%
    kable_styling("striped", full_width = F)
Std Mean R.2
34834.48 255571.98 0.7
#histogram
ggplot(as.data.frame(reg4.cv$resample), aes(MAE)) + 
  geom_histogram(bins = 50, colour="white", fill = "#FA7800") +
  labs(title="Distribution of MAE", subtitle = "k-fold cross validation; k = 100",
       x="Mean Absolute Error", y="Count") +
  plotTheme()

We then further looked into our prediction accuracy and generalizability by plotting predicted prices as a function of observed prices. The result indicated that despite general acceptable predictive power, our model does tend to underpredict the most expensive homes. This may be caused by insufficient sampling of such houses in our database.

ggplot(sanf.test, aes( SalePrice, SalePrice.Predict)) +
  geom_point(colour = "#FA7800") +
  geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
  geom_smooth(data=sanf.test, aes(SalePrice,SalePrice), method = "lm", se = FALSE, colour = "red")+
  labs(title = "Predict Price as a function of observed prices",
       caption = "Red line represents a perfect prediction; Green line represents our prediction",
       x = "Observed Sale Price",
       y = "Predicted Sale Price") +
  plotTheme()

Next, we examined our model’s generalizability over space. The map of our test set’s residuals (or errors) indicated a random-distribution pattern, which suggests the spatial variation related to the underlying structure of home prices has been mostly covered by our model. The result of Moran’s I test also showed that the errors from our model exhibit little spatial autocorrelation.

ggplot() +
    geom_sf(data = sf_basemap, fill = "grey40") +
    geom_sf(data = sanf.test, aes(colour = q5(SalePrice.Error)), show.legend = "point", size = 1) +
    scale_colour_manual(values = palette5,
                        labels=qBr(sanf.test,"SalePrice.Error"),
                        name="Quintile\nBreaks") +
    labs(title="Residuals of the test set") +
    mapTheme()

#moran's I
coords.test <- 
  filter(sanf.test, !is.na(SalePrice.Error)) %>% 
  st_coordinates() 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

moranTest <- moran.mc(filter(sanf.test, !is.na(SalePrice.Error))$SalePrice.Error, 
                      spatialWeights.test , nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in red",
       x="Moran's I",
       y="Count",
       caption="Public Policy Analytics, Figure 6.8") +
  plotTheme()

Having confirmed the accuracy and generalizability of our model, we provide a map of our final predicted values for the entire dataset (both sold and hold-out homes). This predicted home value map showed similar spatial features as the observed home value map plotted in the Data Section, indicating the good predictive power of our model.

reg_holdout <- lm(SalePrice~tree.Buffer100+transit600.Buffer+roadlength_200+parksdensity3+parksdensity2+art.Buffer800+school_1+massage.Buffer1000+storage.Buffer600+Assault600.Buffer+ Assault_1+ PropArea+Baths+Beds+Stories+ slope+ArtEnt600+Construction600+ Private600+ perc_Non_Hisp+ Perc_Poverty+ year+Median_Income+ lagPrice+SaleYr+nhood+GEOID,data=holdout)


predict_all <-
  houseprice.sf3 %>%
  mutate(
    SalePrice.Predict = predict(reg_holdout, houseprice.sf3),
         SalePrice.Error = SalePrice - SalePrice.Predict,
         SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice)%>%
  filter(SalePrice < 6000000)



ggplot() + 
    geom_sf(data = sf_basemap, fill = "grey40") + 
    geom_sf(data = predict_all, aes(colour = q5(SalePrice.Predict)), show.legend = "point", size = 1) +
    scale_colour_manual(values = palette5, 
                        labels=qBr(predict_all,"SalePrice.Predict"),
                        name="Quintile\nBreaks") +
    labs(title="All Predict Sale Prices") +
    mapTheme()

Last but not least, we checked the model’s generalizability over different urban contexts. Using the test set predictions, we plotted a map of mean absolute percentage error (MAPE) by neighborhood, as well as by tracts. We also plotted scatterplot plots of MAPE by neighborhood (or by tracts) as a function of mean price by neighborhood (or by tracts). It’s clear that the Tract Effects model’s accuracy is more consistent across space than the Neighborhood Effects model, which suggests the former is more generalizable. This was the reason why we included tract ID instead of neighborhood names as spatial fixed effect indicator in our final model.

sanf.test.nhoods <-
  sanf.test %>%
    group_by(nhood) %>%
    summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T),
              mean.SalePrice = mean(SalePrice, na.rm = T)) %>%
  st_set_geometry(NULL)%>%
  left_join(neighourhood)%>%
  st_sf()


#by neighbourhood 
ggplot() + 
    geom_sf(data = sanf.test.nhoods, aes(fill = q5(mean.MAPE))) + 
    geom_sf(data = sanf.test, colour="black", size = .3) +
    scale_fill_manual(values = palette5, 
                        labels=qBr(sanf.test.nhoods,"mean.MAPE",rnd=F),
                        name="Quintile\nBreaks") +
    labs(title="Mean test set MAPE by neighbourhood") +
    mapTheme()

#by tract 
tract <- 
  get_acs(geography = "tract", variables = c("B01003_001"), 
          year = 2015, state=06, county=075, geometry=T) %>%
  st_transform(102286) %>%
  dplyr::select(GEOID,geometry)
sanf.test.tract <-
  sanf.test %>%
    group_by(GEOID) %>%
    summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T),
              mean.SalePrice = mean(SalePrice, na.rm = T)) %>%
  st_set_geometry(NULL)%>%
  left_join(tract)%>%
  st_sf()

ggplot() + 
    geom_sf(data = sanf.test.tract, aes(fill = q5(mean.MAPE))) + 
    geom_sf(data = sanf.test, colour="black", size = .3) +
    scale_fill_manual(values = palette5, 
                        labels=qBr(sanf.test.tract,"mean.MAPE",rnd=F),
                        name="Quintile\nBreaks") +
    labs(title="Mean test set MAPE by Tract") +
    mapTheme()

# by tract
ggplot(sanf.test.tract, aes( mean.SalePrice, mean.MAPE)) +
  geom_point(colour = "#FA7800") +
  geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
  labs(title = "MAPE by tract as a function of Mean Price by tract",
       x = "mean.SalePrice",
       y = "mean.MAPE") +
  plotTheme()

#by neighbourhood 
ggplot(sanf.test.nhoods, aes( mean.SalePrice, mean.MAPE)) +
  geom_point(colour = "#FA7800") +
  geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
  labs(title = "MAPE by tract as a function of Mean Price by neighbourhood",
       x = "mean.SalePrice",
       y = "mean.MAPE") +
  plotTheme()

We also used census data to split San Francisco into two groups (by race and by income) to test your model’s generalizability. As the following table indicated, our model generalizes well in different income contexts, yet does not perform as well in different racial contexts. However, since the bias between groups is relatively small (less than 5%), this model still offers a good reference for commercial trades and policy making.

#by racial 
sanf.test %>% 
  group_by(raceContext) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  st_set_geometry(NULL) %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Mean Absolute Error of test set by racial context") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(1, color = "black", background = "#25CB10") 
Mean Absolute Error of test set by racial context
Majority Non-White Majority White
0.2275922545 0.2700526666
#by income
sanf.test %>% 
  group_by(incomeContext) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  st_set_geometry(NULL) %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Mean Absolute Error of test set by racial context") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(1, color = "black", background = "#25CB10") 
Mean Absolute Error of test set by racial context
High Income Low Income
0.2461552914 0.246963671

5 Discussion and Conclusion

Given the above examinations and validations, our model is effective and generable in predicting home values in San Francisco. Overall, it can explain and predict more than 70% of variation in house sale prices. The maps comparing predicted and observed house prices also indicated that our model offered convincible explanations and predictions of the spatial variation in home values. According to parameters and t-values, the most important predictor variables in our model are number of tree (in 100m buffer), art and entertainment facilities (in 600m buffer), stories, built year, sale year, slope and tract ID (tract fixed effect), each stand for different aspects of internal characteristics, amenities and spatial structure. Despite all efforts to balance between accuracy and generalizability, our prediction results still have a 248009 dollars deviation from the real price on average, which is less than 25% in terms of final sale price. As for spatial aspect, the model generalizes well across space and different urban contexts, especially for the majority of sales with moderate transaction price. However, its predictive power is less satisfying for the cheapest and most expensive homes. This shortcoming is most likely to be caused by a lack of representative samples in the training data, which can be improved by expanding sample size and optimizing sample structure.

To sum up, we strongly recommend that Zillow adopt our methodology to improve the accuracy and generalizability of predicting house price in San Francisco, thus facilitating related business and administrative processes such as taxing. Although the model can be further improved by introducing more detailed and high-quality inherent characteristics related variables as well as adopting more powerful predictive algorithms, it has proven itself as a sufficiently predictive and generalizable home price prediction tool and provided a good starting point for future enhancements and generalization.