Quick Start Guide: Heuristic Regressors

This is a Python version of the corresponding Heuristics quick start guide.

In this example we will use regressors from Heuristics on the yacht hydrodynamics dataset. First we load in the data and split into training and test datasets:

import pandas as pd
df = pd.read_csv(
    "yacht_hydrodynamics.data",
    delim_whitespace=True,
    header=None,
    names=['position', 'prismatic', 'length_displacement', 'beam_draught',
           'length_beam', 'froude', 'resistance'],
)
     position  prismatic  length_displacement  ...  length_beam  froude  resistance
0        -2.3      0.568                 4.78  ...         3.17   0.125        0.11
1        -2.3      0.568                 4.78  ...         3.17   0.150        0.27
2        -2.3      0.568                 4.78  ...         3.17   0.175        0.47
3        -2.3      0.568                 4.78  ...         3.17   0.200        0.78
4        -2.3      0.568                 4.78  ...         3.17   0.225        1.18
5        -2.3      0.568                 4.78  ...         3.17   0.250        1.82
6        -2.3      0.568                 4.78  ...         3.17   0.275        2.61
..        ...        ...                  ...  ...          ...     ...         ...
301      -2.3      0.600                 4.34  ...         2.73   0.300        4.15
302      -2.3      0.600                 4.34  ...         2.73   0.325        6.00
303      -2.3      0.600                 4.34  ...         2.73   0.350        8.47
304      -2.3      0.600                 4.34  ...         2.73   0.375       12.27
305      -2.3      0.600                 4.34  ...         2.73   0.400       19.59
306      -2.3      0.600                 4.34  ...         2.73   0.425       30.48
307      -2.3      0.600                 4.34  ...         2.73   0.450       46.66

[308 rows x 7 columns]
from interpretableai import iai
X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]
(train_X, train_y), (test_X, test_y) = iai.split_data('regression', X, y,
                                                      seed=1)

Random Forest Regressor

We will use a GridSearch to fit a RandomForestRegressor with some basic parameter validation:

grid = iai.GridSearch(
    iai.RandomForestRegressor(
        random_seed=1,
    ),
    max_depth=range(5, 11),
)
grid.fit(train_X, train_y)
All Grid Results:

 Row │ max_depth  train_score  valid_score  rank_valid_score
     │ Int64      Float64      Float64      Int64
─────┼───────────────────────────────────────────────────────
   1 │         5     0.998792     0.995212                 6
   2 │         6     0.999109     0.995511                 5
   3 │         7     0.999189     0.995521                 2
   4 │         8     0.999205     0.995522                 1
   5 │         9     0.999207     0.995519                 3
   6 │        10     0.999207     0.995519                 4

Best Params:
  max_depth => 8

Best Model - Fitted RandomForestRegressor

We can make predictions on new data using predict:

grid.predict(test_X)
array([ 0.09691999,  0.28367277,  1.27294232,  2.77383666,  5.083565  ,
       12.91213286, 21.03160167,  0.09782665,  0.28199285,  0.49682971,
        1.77856218,  3.39283254,  4.93092667, 12.81255286, 51.64830833,
        0.49504705,  0.79381551,  5.07728643, 21.208385  , 33.37613595,
        0.09628776,  0.51801915,  1.30719161,  2.77196261,  5.07719833,
        7.69981254,  1.8525929 ,  4.99798   , 33.565825  ,  1.70427105,
        2.50171214,  4.74597167, 12.75123619, 33.39946667,  5.06899833,
        3.97601311,  0.25943276,  0.49114647,  1.28675125, 12.90923952,
       33.140915  , 49.32153667,  0.25665538,  0.76506468,  2.80367111,
        5.33649476, 34.41753095,  7.75091869, 12.66252405,  1.33414134,
        5.32317333, 14.43079667,  0.2790348 ,  0.51743243,  1.23306934,
        1.90676091,  2.78081567, 34.09669762,  0.51744731, 13.636935  ,
        0.09717984,  0.78743438, 38.155     ,  2.31056   , 23.85184   ,
        0.08155502,  0.4880515 ,  1.3018634 ,  8.87674167,  0.48019354,
       19.82010833, 47.48345667,  0.09717984,  0.7835216 ,  3.58963451,
        4.84401333, 14.70656667, 38.2398    , 55.0036    ,  1.21358018,
        1.79415322, 53.8275    ,  0.2510457 ,  2.87445722,  7.89207747,
       12.82145286,  1.27231554,  1.93486377,  2.99519698, 12.93131952,
       33.007085  , 50.49560667])

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

grid.score(train_X, train_y, criterion='mse')
0.9993065912313013

Or on the test set:

grid.score(test_X, test_y, criterion='mse')
0.9937779440350902

We can also look at the variable importance:

grid.get_learner().variable_importance()
               Feature  Importance
0               froude    0.990682
1            prismatic    0.004045
2         beam_draught    0.002431
3             position    0.001415
4  length_displacement    0.001226
5          length_beam    0.000201

XGBoost Regressor

We will use a GridSearch to fit an XGBoostRegressor with some basic parameter validation:

grid = iai.GridSearch(
    iai.XGBoostRegressor(
        random_seed=1,
    ),
    max_depth=range(2, 6),
    num_round=[20, 50, 100],
)
grid.fit(train_X, train_y)
All Grid Results:

 Row │ num_round  max_depth  train_score  valid_score  rank_valid_score
     │ Int64      Int64      Float64      Float64      Int64
─────┼──────────────────────────────────────────────────────────────────
   1 │        20          2     0.997817     0.995225                 6
   2 │        20          3     0.999371     0.9953                   5
   3 │        20          4     0.999748     0.992537                 7
   4 │        20          5     0.999816     0.992295                 8
   5 │        50          2     0.999118     0.99551                  4
   6 │        50          3     0.999904     0.995645                 3
   7 │        50          4     0.999953     0.991627                 9
   8 │        50          5     0.999966     0.990115                11
   9 │       100          2     0.999632     0.9962                   1
  10 │       100          3     0.999904     0.995646                 2
  11 │       100          4     0.999953     0.991627                10
  12 │       100          5     0.999966     0.990115                12

Best Params:
  num_round => 100
  max_depth => 2

Best Model - Fitted XGBoostRegressor

We can make predictions on new data using predict:

grid.predict(test_X)
array([ 0.23346329,  0.37805462,  1.26587391,  2.8110404 ,  5.4151907 ,
       12.59899044, 20.52502251,  0.21096897,  0.25713634,  0.42427921,
        1.6431427 ,  3.31036711,  4.93886709, 12.45051765, 51.10633087,
        0.5053463 ,  0.81924629,  5.67969608, 21.48241425, 33.42628479,
        0.23699856,  0.54873276,  1.26743984,  2.81260777,  5.43328571,
        7.79358101,  2.13622856,  5.82654142, 33.68614578,  1.78495407,
        2.50792265,  5.20168018, 12.50411606, 34.9638176 ,  5.34564257,
        3.52108955,  0.11473274,  0.34785557,  1.160779  , 12.91129589,
       33.26576614, 51.12511063,  0.257864  ,  0.87720394,  2.95927191,
        6.89947891, 36.78110886,  7.98637676, 12.86630917,  1.25722218,
        6.60506535, 14.72911072,  0.23804474,  0.40518761,  1.16177273,
        1.79500675,  2.70694017, 32.55923843,  0.34300327, 13.72436047,
        0.10033131,  0.66732693, 38.25563049,  1.96538162, 26.26293755,
        0.2122879 ,  0.69859791,  1.63727951,  9.43823242,  0.5461874 ,
       20.60538101, 48.14624023,  0.09963322,  0.85429096,  4.17500591,
        6.6595397 , 15.96043587, 39.61088943, 59.35378647,  1.19994926,
        1.77716351, 56.34794617,  0.26231194,  2.81908894,  7.73881817,
       12.53344917,  1.457201  ,  2.09043598,  3.0023675 , 11.98635006,
       33.26702118, 49.76500702])

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

grid.score(train_X, train_y, criterion='mse')
0.9995068453374839

Or on the test set:

grid.score(test_X, test_y, criterion='mse')
0.9973451304965533

We can also look at the variable importance:

grid.get_learner().variable_importance()
               Feature  Importance
0               froude    0.993665
1            prismatic    0.002547
2         beam_draught    0.001372
3  length_displacement    0.001154
4             position    0.000648
5          length_beam    0.000614

GLMNet Regressor

We can use a GLMNetCVRegressor to fit a GLMNet model using cross-validation:

lnr = iai.GLMNetCVRegressor(
    random_seed=1,
    nfolds=10,
)
lnr.fit(train_X, train_y)
Fitted GLMNetCVRegressor:
  Constant: -22.0757
  Weights:
    froude:  113.256

We can access the coefficients from the fitted model with get_prediction_weights and get_prediction_constant:

numeric_weights, categoric_weights = lnr.get_prediction_weights()
numeric_weights
{'froude': 113.25649906342936}
categoric_weights
{}
lnr.get_prediction_constant()
-22.07569550760808

We can make predictions on new data using predict:

lnr.predict(test_X)
array([-7.91863312, -5.08722065,  3.40701678,  9.06984173, 14.73266669,
       20.39549164, 23.22690412, -7.91863312, -5.08722065, -2.25580817,
        6.23842926, 11.90125421, 14.73266669, 20.39549164, 28.88972907,
       -2.25580817,  0.57560431, 14.73266669, 23.22690412, 26.05831659,
       -7.91863312, -2.25580817,  3.40701678,  9.06984173, 14.73266669,
       17.56407916,  6.23842926, 14.73266669, 26.05831659,  6.23842926,
        9.06984173, 14.73266669, 20.39549164, 26.05831659, 14.73266669,
       11.90125421, -5.08722065, -2.25580817,  3.40701678, 20.39549164,
       26.05831659, 28.88972907, -5.08722065,  0.57560431,  9.06984173,
       14.73266669, 26.05831659, 17.56407916, 20.39549164,  3.40701678,
       14.73266669, 20.39549164, -5.08722065, -2.25580817,  3.40701678,
        6.23842926,  9.06984173, 26.05831659, -2.25580817, 20.39549164,
       -7.91863312,  0.57560431, 26.05831659,  9.06984173, 23.22690412,
       -7.91863312, -2.25580817,  3.40701678, 17.56407916, -2.25580817,
       23.22690412, 28.88972907, -7.91863312,  0.57560431, 11.90125421,
       14.73266669, 20.39549164, 26.05831659, 28.88972907,  3.40701678,
        6.23842926, 28.88972907, -5.08722065,  9.06984173, 17.56407916,
       20.39549164,  3.40701678,  6.23842926,  9.06984173, 20.39549164,
       26.05831659, 28.88972907])

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the $R^2$ on the training set:

lnr.score(train_X, train_y, criterion='mse')
0.6541519917396235

Or on the test set:

lnr.score(test_X, test_y, criterion='mse')
0.6504195810342512

We can also look at the variable importance:

lnr.variable_importance()
               Feature  Importance
0               froude         1.0
1         beam_draught         0.0
2          length_beam         0.0
3  length_displacement         0.0
4             position         0.0
5            prismatic         0.0