Predicting House Sale Prices

In this example, we aim to predict the sales price of houses in King County, Washington, using a dataset available on kaggle. We will start with Optimal Regression Trees with constant predictions. The results give us evidence to suspect that there are strong linear effects that could be modeled more directly. We use Optimal Sparse Regression to detect the key linear predictor, and fit Optimal Regression Trees with linear predictions using this variable. We show this careful modeling leads to significant gains in both performance and interpretability.

First, we load the data and create the features and labels:

using CSV, DataFrames
data = DataFrame(CSV.File("kc_house_data.csv"))

categorical = [
    :waterfront,
]
ordinal = [
    :condition,
    :view,
    :grade,
]
continuous = [
    :floors,
    :bedrooms,
    :bathrooms,
    :sqft_above,
    :sqft_basement,
    :yr_built,
    :yr_renovated,
    :sqft_living,
    :sqft_lot,
    :lat,
    :long,
    :sqft_living15,
    :sqft_lot15,
]
X = data[:, [categorical; ordinal; continuous]]
y = data[:, :price]
for var in categorical
  categorical!(X, var)
end
for var in ordinal
  X[!, var] = CategoricalArray(X[!, var], ordered=true)
  levels!(X[!, var], sort(levels(X[!, var])))
end
X
21613×17 DataFrame. Omitted printing of 11 columns
│ Row   │ waterfront │ condition │ view │ grade │ floors  │ bedrooms │
│       │ Cat…       │ Cat…      │ Cat… │ Cat…  │ Float64 │ Int64    │
├───────┼────────────┼───────────┼──────┼───────┼─────────┼──────────┤
│ 1     │ 0          │ 3         │ 0    │ 7     │ 1.0     │ 3        │
│ 2     │ 0          │ 3         │ 0    │ 7     │ 2.0     │ 3        │
│ 3     │ 0          │ 3         │ 0    │ 6     │ 1.0     │ 2        │
│ 4     │ 0          │ 5         │ 0    │ 7     │ 1.0     │ 4        │
│ 5     │ 0          │ 3         │ 0    │ 8     │ 1.0     │ 3        │
│ 6     │ 0          │ 3         │ 0    │ 11    │ 1.0     │ 4        │
│ 7     │ 0          │ 3         │ 0    │ 7     │ 2.0     │ 3        │
⋮
│ 21606 │ 0          │ 3         │ 0    │ 9     │ 2.0     │ 4        │
│ 21607 │ 0          │ 3         │ 0    │ 9     │ 2.0     │ 4        │
│ 21608 │ 0          │ 3         │ 0    │ 8     │ 2.0     │ 3        │
│ 21609 │ 0          │ 3         │ 0    │ 8     │ 3.0     │ 3        │
│ 21610 │ 0          │ 3         │ 0    │ 8     │ 2.0     │ 4        │
│ 21611 │ 0          │ 3         │ 0    │ 7     │ 2.0     │ 2        │
│ 21612 │ 0          │ 3         │ 0    │ 8     │ 2.0     │ 3        │
│ 21613 │ 0          │ 3         │ 0    │ 7     │ 2.0     │ 2        │

We split the data into training and testing:

seed = 1
(X_train, y_train), (X_test, y_test) =
    IAI.split_data(:regression, X, y, seed=seed, train_proportion=0.5)

First, we will try a simple regression tree with constant predictions to gain some understanding of the various features interact:

grid_constant_tree = IAI.GridSearch(
    IAI.OptimalTreeRegressor(random_seed=seed, minbucket=50),
    max_depth=1:5,
)
IAI.fit!(grid_constant_tree, X_train, y_train)
Optimal Trees Visualization