Quick Start Guide: Heuristic Regressors

In this example we will use regressors from Heuristics on the yacht hydrodynamics dataset. First we load in the data and split into training and test datasets:

using CSV, DataFrames
df = DataFrame(CSV.File(
"yacht_hydrodynamics.data",
delim=' ',            # file uses ' ' as separators rather than ','
ignorerepeated=true,  # sometimes columns are separated by more than one ' '
:length_beam, :froude, :resistance],
))
308×7 DataFrame
Row │ position  prismatic  length_displacement  beam_draught  length_beam  fr ⋯
│ Float64   Float64    Float64              Float64       Float64      Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │     -2.3      0.568                 4.78          3.99         3.17     ⋯
2 │     -2.3      0.568                 4.78          3.99         3.17
3 │     -2.3      0.568                 4.78          3.99         3.17
4 │     -2.3      0.568                 4.78          3.99         3.17
5 │     -2.3      0.568                 4.78          3.99         3.17     ⋯
6 │     -2.3      0.568                 4.78          3.99         3.17
7 │     -2.3      0.568                 4.78          3.99         3.17
8 │     -2.3      0.568                 4.78          3.99         3.17
⋮  │    ⋮          ⋮               ⋮                ⋮             ⋮          ⋱
302 │     -2.3      0.6                   4.34          4.23         2.73     ⋯
303 │     -2.3      0.6                   4.34          4.23         2.73
304 │     -2.3      0.6                   4.34          4.23         2.73
305 │     -2.3      0.6                   4.34          4.23         2.73
306 │     -2.3      0.6                   4.34          4.23         2.73     ⋯
307 │     -2.3      0.6                   4.34          4.23         2.73
308 │     -2.3      0.6                   4.34          4.23         2.73
2 columns and 293 rows omitted
X = df[:, 1:(end - 1)]
y = df[:, end]
(train_X, train_y), (test_X, test_y) = IAI.split_data(:regression, X, y, seed=1)

Random Forest Regressor

We will use a GridSearch to fit a RandomForestRegressor with some basic parameter validation:

grid = IAI.GridSearch(
IAI.RandomForestRegressor(
random_seed=1,
),
max_depth=5:10,
)
IAI.fit!(grid, train_X, train_y)
All Grid Results:

Row │ max_depth  train_score  valid_score  rank_valid_score
│ Int64      Float64      Float64      Int64
─────┼───────────────────────────────────────────────────────
1 │         5     0.994449     0.990294                 6
2 │         6     0.994511     0.990322                 1
3 │         7     0.994515     0.990321                 2
4 │         8     0.994515     0.990321                 3
5 │         9     0.994515     0.990321                 4
6 │        10     0.994515     0.990321                 5

Best Params:
max_depth => 6

Best Model - Fitted RandomForestRegressor

We can make predictions on new data using predict:

IAI.predict(grid, test_X)
92-element Vector{Float64}:
0.12836191255
0.24591597504
1.293050230222
2.854614345067
5.201156146354
13.586306926407
20.989118226218
0.12756191255
0.24591597504
0.497498593494
⋮
2.872119106971
8.033722622378
13.586306926407
1.280463765576
1.979024639249
2.896419464114
13.862548174881
33.505260589966
53.294637063492

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

IAI.score(grid, train_X, train_y, criterion=:mse)
0.9954616758740213

Or on the test set:

IAI.score(grid, test_X, test_y, criterion=:mse)
0.9898501909798273

We can also look at the variable importance:

IAI.variable_importance(IAI.get_learner(grid))
6×2 DataFrame
Row │ Feature              Importance
│ Symbol               Float64
─────┼──────────────────────────────────
1 │ froude               0.994405
2 │ prismatic            0.00165704
3 │ length_displacement  0.00161713
4 │ beam_draught         0.00142863
5 │ length_beam          0.000663934
6 │ position             0.000227865

XGBoost Regressor

We will use a GridSearch to fit a XGBoostRegressor with some basic parameter validation:

grid = IAI.GridSearch(
IAI.XGBoostRegressor(
random_seed=1,
),
max_depth=2:5,
num_round=[20, 50, 100],
)
IAI.fit!(grid, train_X, train_y)
All Grid Results:

Row │ num_round  max_depth  train_score  valid_score  rank_valid_score
│ Int64      Int64      Float64      Float64      Int64
─────┼──────────────────────────────────────────────────────────────────
1 │        20          2     0.99756      0.993216                 3
2 │        20          3     0.999231     0.99266                  9
3 │        20          4     0.999735     0.993189                 4
4 │        20          5     0.999831     0.99201                 10
5 │        50          2     0.999287     0.995097                 2
6 │        50          3     0.999892     0.993184                 5
7 │        50          4     0.999959     0.992946                 7
8 │        50          5     0.999973     0.990473                11
9 │       100          2     0.999698     0.996021                 1
10 │       100          3     0.9999       0.993153                 6
11 │       100          4     0.999959     0.992946                 8
12 │       100          5     0.999973     0.990472                12

Best Params:
num_round => 100
max_depth => 2

Best Model - Fitted XGBoostRegressor

We can make predictions on new data using predict:

IAI.predict(grid, test_X)
92-element Vector{Float64}:
0.23346585
0.37805766
1.26587629
2.81103897
5.41519117
12.59898853
20.52502441
0.21097377
0.25714141
0.42428526
⋮
2.81908727
7.73881865
12.53344631
1.45720315
2.09043741
3.00236559
11.98634815
33.26702881
49.76499939

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

IAI.score(grid, train_X, train_y, criterion=:mse)
0.99950684

Or on the test set:

IAI.score(grid, test_X, test_y, criterion=:mse)
0.99734513

We can also look at the variable importance:

IAI.variable_importance(IAI.get_learner(grid))
6×2 DataFrame
Row │ Feature              Importance
│ Symbol               Float64
─────┼──────────────────────────────────
1 │ froude               0.993665
2 │ prismatic            0.00254683
3 │ beam_draught         0.00137171
4 │ length_displacement  0.00115397
5 │ position             0.000648411
6 │ length_beam          0.000614171

We can calculate the SHAP values:

s = IAI.predict_shap(grid, test_X)
s[:shap_values]
92×6 Matrix{Float64}:
-0.0632808  -0.0114608    -0.0391644   -0.0184659    -0.0815269    -9.86757
-0.0664552  -0.0114608    -0.0378506   -0.0076839    -0.0815269    -9.7319
-0.0723957  -0.012742     -0.0430042    0.00760984   -0.134582     -8.79394
-0.0723957  -0.00525591   -0.0430042    0.0375999    -0.134582     -7.28626
-0.175767    0.000270572  -0.0390034    0.0418154    -0.17572      -4.55134
-0.214559   -0.196073     -0.0348593    0.000398656  -0.160192      2.88934
-0.205349   -0.493743     -0.0348593   -0.0975699    -0.204386     11.246
-0.0632808  -0.0534274    -0.0380514    0.0455295     0.026066    -10.0208
-0.0664552  -0.0534274    -0.0367376    0.0070994     0.026066     -9.93434
-0.0723957  -0.0534274    -0.0367376    0.0070994     0.00867438   -9.74386
⋮                                                                  ⋮
-0.0767186  -0.128861      0.00186686   0.0375999    -0.11716      -7.21257
-0.18009    -0.177774      0.00586762   0.0418154    -0.158297     -2.10764
-0.218882   -0.342035     -0.00442569   0.000398656  -0.142769      2.92623
-0.0809773  -0.140015      0.0713933    0.00760984    0.13303      -8.84877
-0.0809773  -0.138004      0.0713933    0.0184519     0.13303      -8.22839
-0.0809773  -0.132529      0.0713933    0.0375999     0.13303      -7.34109
-0.223141   -0.345702      0.0795381    0.000398656  -0.272206      2.43253
-0.147518   -0.643373      0.5539      -0.30065      -0.273265     23.763
-0.17315    -1.51305       0.799156    -0.285291      0.132692     40.4897

We can then use the SHAP library in Python to visualize these results in whichever way we prefer. For example, the following code creates a summary plot:

using PyCall
shap = pyimport("shap")
shap.summary_plot(s[:shap_values], Matrix(s[:features]), names(s[:features]))

GLMNet Regressor

We can use a GLMNetCVRegressor to fit a GLMNet model using cross-validation:

lnr = IAI.GLMNetCVRegressor(
random_seed=1,
n_folds=10,
)
IAI.fit!(lnr, train_X, train_y)
Fitted GLMNetCVRegressor:
Constant: -22.2638
Weights:
froude:  113.914

We can access the coefficients from the fitted model with get_prediction_weights and get_prediction_constant:

numeric_weights, categoric_weights = IAI.get_prediction_weights(lnr)
numeric_weights
Dict{Symbol, Float64} with 1 entry:
:froude => 113.914
categoric_weights
Dict{Symbol, Dict{Any, Float64}}()
IAI.get_prediction_constant(lnr)
-22.263799

We can make predictions on new data using predict:

IAI.predict(lnr, test_X)
92-element Vector{Float64}:
-8.024522
-5.176667
3.366899
9.06261
14.758321
20.454031
23.301887
-8.024522
-5.176667
-2.328811
⋮
9.06261
17.606176
20.454031
3.366899
6.214755
9.06261
20.454031
26.149742
28.997597

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

IAI.score(lnr, train_X, train_y, criterion=:mse)
0.654571718994

Or on the test set:

IAI.score(lnr, test_X, test_y, criterion=:mse)
0.651006827713

We can also look at the variable importance:

IAI.variable_importance(lnr)
6×2 DataFrame
Row │ Feature              Importance
│ Symbol               Float64
─────┼─────────────────────────────────
1 │ froude                      1.0
2 │ beam_draught                0.0
3 │ length_beam                 0.0
4 │ length_displacement         0.0
5 │ position                    0.0
6 │ prismatic                   0.0