Quick Start Guide: Heuristic Regressors

This is a Python version of the corresponding Heuristics quick start guide.

In this example we will use regressors from Heuristics on the yacht hydrodynamics dataset. First we load in the data and split into training and test datasets:

import pandas as pd
df = pd.read_csv(
    "yacht_hydrodynamics.data",
    delim_whitespace=True,
    header=None,
    names=['position', 'prismatic', 'length_displacement', 'beam_draught',
           'length_beam', 'froude', 'resistance'],
)
     position  prismatic  length_displacement  ...  length_beam  froude  resistance
0        -2.3      0.568                 4.78  ...         3.17   0.125        0.11
1        -2.3      0.568                 4.78  ...         3.17   0.150        0.27
2        -2.3      0.568                 4.78  ...         3.17   0.175        0.47
3        -2.3      0.568                 4.78  ...         3.17   0.200        0.78
4        -2.3      0.568                 4.78  ...         3.17   0.225        1.18
5        -2.3      0.568                 4.78  ...         3.17   0.250        1.82
6        -2.3      0.568                 4.78  ...         3.17   0.275        2.61
..        ...        ...                  ...  ...          ...     ...         ...
301      -2.3      0.600                 4.34  ...         2.73   0.300        4.15
302      -2.3      0.600                 4.34  ...         2.73   0.325        6.00
303      -2.3      0.600                 4.34  ...         2.73   0.350        8.47
304      -2.3      0.600                 4.34  ...         2.73   0.375       12.27
305      -2.3      0.600                 4.34  ...         2.73   0.400       19.59
306      -2.3      0.600                 4.34  ...         2.73   0.425       30.48
307      -2.3      0.600                 4.34  ...         2.73   0.450       46.66

[308 rows x 7 columns]
from interpretableai import iai
X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]
(train_X, train_y), (test_X, test_y) = iai.split_data('regression', X, y,
                                                      seed=1)

Random Forest Regressor

We will use a GridSearch to fit a RandomForestRegressor with some basic parameter validation:

grid = iai.GridSearch(
    iai.RandomForestRegressor(
        random_seed=1,
    ),
    max_depth=range(5, 11),
)
grid.fit(train_X, train_y)
All Grid Results:

 Row │ max_depth  train_score  valid_score  rank_valid_score
     │ Int64      Float64      Float64      Int64
─────┼───────────────────────────────────────────────────────
   1 │         5     0.994449     0.990294                 6
   2 │         6     0.994511     0.990322                 1
   3 │         7     0.994515     0.990321                 2
   4 │         8     0.994515     0.990321                 3
   5 │         9     0.994515     0.990321                 4
   6 │        10     0.994515     0.990321                 5

Best Params:
  max_depth => 6

Best Model - Fitted RandomForestRegressor

We can make predictions on new data using predict:

grid.predict(test_X)
array([ 0.12836191,  0.24591598,  1.29305023, ..., 13.86254817,
       33.50526059, 53.29463706])

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

grid.score(train_X, train_y, criterion='mse')
0.9954616758740213

Or on the test set:

grid.score(test_X, test_y, criterion='mse')
0.9898501909798273

We can also look at the variable importance:

grid.get_learner().variable_importance()
               Feature  Importance
0               froude    0.994405
1            prismatic    0.001657
2  length_displacement    0.001617
3         beam_draught    0.001429
4          length_beam    0.000664
5             position    0.000228

XGBoost Regressor

We will use a GridSearch to fit an XGBoostRegressor with some basic parameter validation:

grid = iai.GridSearch(
    iai.XGBoostRegressor(
        random_seed=1,
    ),
    max_depth=range(2, 6),
    num_round=[20, 50, 100],
)
grid.fit(train_X, train_y)
All Grid Results:

 Row │ num_round  max_depth  train_score  valid_score  rank_valid_score
     │ Int64      Int64      Float64      Float64      Int64
─────┼──────────────────────────────────────────────────────────────────
   1 │        20          2     0.99756      0.993216                 3
   2 │        20          3     0.999231     0.99266                  9
   3 │        20          4     0.999735     0.993189                 4
   4 │        20          5     0.999831     0.99201                 10
   5 │        50          2     0.999287     0.995097                 2
   6 │        50          3     0.999892     0.993184                 5
   7 │        50          4     0.999959     0.992946                 7
   8 │        50          5     0.999973     0.990473                11
   9 │       100          2     0.999698     0.996021                 1
  10 │       100          3     0.9999       0.993153                 6
  11 │       100          4     0.999959     0.992946                 8
  12 │       100          5     0.999973     0.990472                12

Best Params:
  num_round => 100
  max_depth => 2

Best Model - Fitted XGBoostRegressor

We can make predictions on new data using predict:

grid.predict(test_X)
array([ 0.23346329,  0.37805462,  1.26587391, ..., 11.98635006,
       33.26702118, 49.76500702])

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

grid.score(train_X, train_y, criterion='mse')
0.9995068453374839

Or on the test set:

grid.score(test_X, test_y, criterion='mse')
0.9973451304965533

We can also look at the variable importance:

grid.get_learner().variable_importance()
               Feature  Importance
0               froude    0.993665
1            prismatic    0.002547
2         beam_draught    0.001372
3  length_displacement    0.001154
4             position    0.000648
5          length_beam    0.000614

We can calculate the SHAP values:

s = grid.predict_shap(test_X)
s['shap_values']
array([[-6.32807910e-02, -1.14607625e-02, -3.91643867e-02,
        -1.84658561e-02, -8.15268755e-02, -9.86756802e+00],
       [-6.64551705e-02, -1.14607625e-02, -3.78506146e-02,
        -7.68390484e-03, -8.15268755e-02, -9.73189735e+00],
       [-7.23956525e-02, -1.27419746e-02, -4.30041552e-02,
         7.60984235e-03, -1.34582460e-01, -8.79394150e+00],
       ...,
       [-2.23140568e-01, -3.45702350e-01,  7.95381218e-02,
         3.98655888e-04, -2.72205651e-01,  2.43252921e+00],
       [-1.47517592e-01, -6.43372834e-01,  5.53900421e-01,
        -3.00649643e-01, -2.73264706e-01,  2.37629852e+01],
       [-1.73150003e-01, -1.51304734e+00,  7.99156308e-01,
        -2.85290539e-01,  1.32692412e-01,  4.04897118e+01]])

We can then use the SHAP library to visualize these results in whichever way we prefer.

GLMNet Regressor

We can use a GLMNetCVRegressor to fit a GLMNet model using cross-validation:

lnr = iai.GLMNetCVRegressor(
    random_seed=1,
    nfolds=10,
)
lnr.fit(train_X, train_y)
Fitted GLMNetCVRegressor:
  Constant: -22.2638
  Weights:
    froude:  113.914

We can access the coefficients from the fitted model with get_prediction_weights and get_prediction_constant:

numeric_weights, categoric_weights = lnr.get_prediction_weights()
numeric_weights
{'froude': 113.91421370565072}
categoric_weights
{}
lnr.get_prediction_constant()
-22.26379885030819

We can make predictions on new data using predict:

lnr.predict(test_X)
array([-8.02452214, -5.17666679,  3.36689923, ..., 20.45403129,
       26.14974197, 28.99759732])

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

lnr.score(train_X, train_y, criterion='mse')
0.6545717189938915

Or on the test set:

lnr.score(test_X, test_y, criterion='mse')
0.6510068277128165

We can also look at the variable importance:

lnr.variable_importance()
               Feature  Importance
0               froude         1.0
1         beam_draught         0.0
2          length_beam         0.0
3  length_displacement         0.0
4             position         0.0
5            prismatic         0.0