Quick Start Guide: Heuristic Regressors

In this example we will use regressors from Heuristics on the yacht hydrodynamics dataset. First we load in the data and split into training and test datasets:

using CSV, DataFrames
df = DataFrame(CSV.File(
    "yacht_hydrodynamics.data",
    delim=' ',            # file uses ' ' as separators rather than ','
    ignorerepeated=true,  # sometimes columns are separated by more than one ' '
    header=[:position, :prismatic, :length_displacement, :beam_draught,
            :length_beam, :froude, :resistance],
))
308×7 DataFrame
 Row │ position  prismatic  length_displacement  beam_draught  length_beam  fr ⋯
     │ Float64   Float64    Float64              Float64       Float64      Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │     -2.3      0.568                 4.78          3.99         3.17     ⋯
   2 │     -2.3      0.568                 4.78          3.99         3.17
   3 │     -2.3      0.568                 4.78          3.99         3.17
   4 │     -2.3      0.568                 4.78          3.99         3.17
   5 │     -2.3      0.568                 4.78          3.99         3.17     ⋯
   6 │     -2.3      0.568                 4.78          3.99         3.17
   7 │     -2.3      0.568                 4.78          3.99         3.17
   8 │     -2.3      0.568                 4.78          3.99         3.17
  ⋮  │    ⋮          ⋮               ⋮                ⋮             ⋮          ⋱
 302 │     -2.3      0.6                   4.34          4.23         2.73     ⋯
 303 │     -2.3      0.6                   4.34          4.23         2.73
 304 │     -2.3      0.6                   4.34          4.23         2.73
 305 │     -2.3      0.6                   4.34          4.23         2.73
 306 │     -2.3      0.6                   4.34          4.23         2.73     ⋯
 307 │     -2.3      0.6                   4.34          4.23         2.73
 308 │     -2.3      0.6                   4.34          4.23         2.73
                                                  2 columns and 293 rows omitted
X = df[:, 1:(end - 1)]
y = df[:, end]
(train_X, train_y), (test_X, test_y) = IAI.split_data(:regression, X, y, seed=1)

Random Forest Regressor

We will use a GridSearch to fit a RandomForestRegressor with some basic parameter validation:

grid = IAI.GridSearch(
    IAI.RandomForestRegressor(
        random_seed=1,
    ),
    max_depth=5:10,
)
IAI.fit!(grid, train_X, train_y)
All Grid Results:

 Row │ max_depth  train_score  valid_score  rank_valid_score
     │ Int64      Float64      Float64      Int64
─────┼───────────────────────────────────────────────────────
   1 │         5     0.998792     0.995212                 6
   2 │         6     0.999109     0.995511                 5
   3 │         7     0.999189     0.995521                 2
   4 │         8     0.999205     0.995522                 1
   5 │         9     0.999207     0.995519                 3
   6 │        10     0.999207     0.995519                 4

Best Params:
  max_depth => 8

Best Model - Fitted RandomForestRegressor

We can make predictions on new data using predict:

IAI.predict(grid, test_X)
92-element Array{Float64,1}:
  0.096919986981
  0.283672772148
  1.2729423224
  2.773836659452
  5.083565
 12.912132857143
 21.031601666667
  0.097826653648
  0.281992851513
  0.49682970807
  ⋮
  2.874457215007
  7.892077467532
 12.821452857143
  1.272315543901
  1.934863766511
  2.995196976912
 12.93131952381
 33.007085
 50.495606666667

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

IAI.score(grid, train_X, train_y, criterion=:mse)
0.9993065912313013

Or on the test set:

IAI.score(grid, test_X, test_y, criterion=:mse)
0.9937779440350902

We can also look at the variable importance:

IAI.variable_importance(IAI.get_learner(grid))
6×2 DataFrame
 Row │ Feature              Importance
     │ Symbol               Float64
─────┼──────────────────────────────────
   1 │ froude               0.990682
   2 │ prismatic            0.00404472
   3 │ beam_draught         0.00243067
   4 │ position             0.0014151
   5 │ length_displacement  0.00122642
   6 │ length_beam          0.000201347

XGBoost Regressor

We will use a GridSearch to fit a XGBoostRegressor with some basic parameter validation:

grid = IAI.GridSearch(
    IAI.XGBoostRegressor(
        random_seed=1,
    ),
    max_depth=2:5,
    num_round=[20, 50, 100],
)
IAI.fit!(grid, train_X, train_y)
All Grid Results:

 Row │ num_round  max_depth  train_score  valid_score  rank_valid_score
     │ Int64      Int64      Float64      Float64      Int64
─────┼──────────────────────────────────────────────────────────────────
   1 │        20          2     0.997817     0.995225                 6
   2 │        20          3     0.999371     0.9953                   5
   3 │        20          4     0.999748     0.992537                 7
   4 │        20          5     0.999816     0.992295                 8
   5 │        50          2     0.999118     0.99551                  4
   6 │        50          3     0.999904     0.995645                 3
   7 │        50          4     0.999953     0.991627                 9
   8 │        50          5     0.999966     0.990115                11
   9 │       100          2     0.999632     0.9962                   1
  10 │       100          3     0.999904     0.995646                 2
  11 │       100          4     0.999953     0.991627                10
  12 │       100          5     0.999966     0.990115                12

Best Params:
  num_round => 100
  max_depth => 2

Best Model - Fitted XGBoostRegressor

We can make predictions on new data using predict:

IAI.predict(grid, test_X)
92-element Array{Float64,1}:
  0.23346328735351562
  0.3780546188354492
  1.265873908996582
  2.8110404014587402
  5.415190696716309
 12.598990440368652
 20.525022506713867
  0.2109689712524414
  0.25713634490966797
  0.42427921295166016
  ⋮
  2.819088935852051
  7.738818168640137
 12.533449172973633
  1.4572010040283203
  2.0904359817504883
  3.0023674964904785
 11.986350059509277
 33.26702117919922
 49.76500701904297

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

IAI.score(grid, train_X, train_y, criterion=:mse)
0.9995068453374839

Or on the test set:

IAI.score(grid, test_X, test_y, criterion=:mse)
0.9973451304965533

We can also look at the variable importance:

IAI.variable_importance(IAI.get_learner(grid))
6×2 DataFrame
 Row │ Feature              Importance
     │ Symbol               Float64
─────┼──────────────────────────────────
   1 │ froude               0.993665
   2 │ prismatic            0.00254683
   3 │ beam_draught         0.00137171
   4 │ length_displacement  0.00115397
   5 │ position             0.000648411
   6 │ length_beam          0.000614171

GLMNet Regressor

We can use a GLMNetCVRegressor to fit a GLMNet model using cross-validation:

lnr = IAI.GLMNetCVRegressor(
    random_seed=1,
    nfolds=10,
)
IAI.fit!(lnr, train_X, train_y)
Fitted GLMNetCVRegressor:
  Constant: -22.0757
  Weights:
    froude:  113.256

We can access the coefficients from the fitted model with get_prediction_weights and get_prediction_constant:

numeric_weights, categoric_weights = IAI.get_prediction_weights(lnr)
numeric_weights
Dict{Symbol,Float64} with 1 entry:
  :froude => 113.256
categoric_weights
Dict{Symbol,Dict{Any,Float64}}()
IAI.get_prediction_constant(lnr)
-22.07569550760808

We can make predictions on new data using predict:

IAI.predict(lnr, test_X)
92-element Array{Float64,1}:
 -7.9186331246794115
 -5.087220648093677
  3.4070167816635255
  9.069841734834995
 14.73266668800646
 20.395491641177927
 23.226904117763667
 -7.9186331246794115
 -5.087220648093677
 -2.255808171507944
  ⋮
  9.069841734834995
 17.564079164592194
 20.395491641177927
  3.4070167816635255
  6.238429258249258
  9.069841734834995
 20.395491641177927
 26.058316594349392
 28.889729070935132

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

IAI.score(lnr, train_X, train_y, criterion=:mse)
0.6541519917396235

Or on the test set:

IAI.score(lnr, test_X, test_y, criterion=:mse)
0.6504195810342512

We can also look at the variable importance:

IAI.variable_importance(lnr)
6×2 DataFrame
 Row │ Feature              Importance
     │ Symbol               Float64
─────┼─────────────────────────────────
   1 │ froude                      1.0
   2 │ beam_draught                0.0
   3 │ length_beam                 0.0
   4 │ length_displacement         0.0
   5 │ position                    0.0
   6 │ prismatic                   0.0