Predicting House Sale Prices

In this example, we aim to predict the sales price of houses in King County, Washington, using a dataset available on Kaggle. We will start with Optimal Regression Trees with constant predictions. The results give us evidence to suspect that there are strong linear effects that could be modeled more directly. We use Optimal Sparse Regression to detect the key linear predictor, and fit Optimal Regression Trees with linear predictions using this variable. We show this careful modeling leads to significant gains in both performance and interpretability.

Data Preparation

First, we load the data and create the features and labels:

using CSV, DataFrames
data = CSV.read("kc_house_data.csv", DataFrame)

using CategoricalArrays
categoric = [
    :waterfront,
]
ordinal = [
    :condition,
    :view,
    :grade,
]
continuous = [
    :floors,
    :bedrooms,
    :bathrooms,
    :sqft_above,
    :sqft_basement,
    :yr_built,
    :yr_renovated,
    :sqft_living,
    :sqft_lot,
    :lat,
    :long,
    :sqft_living15,
    :sqft_lot15,
]
X = select(data,
    categoric .=> categorical,
    ordinal .=> c -> categorical(c, ordered=true),
    continuous,
    renamecols=false,
)

y = data.price

X
21613×17 DataFrame
   Row │ waterfront  condition  view  grade  floors   bedrooms  bathrooms  sqf ⋯
       │ Cat…        Cat…       Cat…  Cat…   Float64  Int64     Float64    Int ⋯
───────┼────────────────────────────────────────────────────────────────────────
     1 │ 0           3          0     7          1.0         3       1.0       ⋯
     2 │ 0           3          0     7          2.0         3       2.25
     3 │ 0           3          0     6          1.0         2       1.0
     4 │ 0           5          0     7          1.0         4       3.0
     5 │ 0           3          0     8          1.0         3       2.0       ⋯
     6 │ 0           3          0     11         1.0         4       4.5
     7 │ 0           3          0     7          2.0         3       2.25
     8 │ 0           3          0     7          1.0         3       1.5
   ⋮   │     ⋮           ⋮       ⋮      ⋮       ⋮        ⋮          ⋮          ⋱
 21607 │ 0           3          0     9          2.0         4       3.5       ⋯
 21608 │ 0           3          0     8          2.0         3       2.5
 21609 │ 0           3          0     8          3.0         3       2.5
 21610 │ 0           3          0     8          2.0         4       2.5
 21611 │ 0           3          0     7          2.0         2       0.75      ⋯
 21612 │ 0           3          0     8          2.0         3       2.5
 21613 │ 0           3          0     7          2.0         2       0.75
                                               10 columns and 21598 rows omitted

We split the data into training and testing:

seed = 123
(X_train, y_train), (X_test, y_test) =
    IAI.split_data(:regression, X, y, seed=seed, train_proportion=0.5)

Regression Tree with Constant Predictions

First, we will try a simple regression tree with constant predictions to gain some understanding of the various features interact:

grid_constant_tree = IAI.GridSearch(
    IAI.OptimalTreeRegressor(random_seed=seed, minbucket=50),
    max_depth=1:5,
)
IAI.fit!(grid_constant_tree, X_train, y_train)
Optimal Trees Visualization