Quick Start Guide: Heuristic Classifiers

In this example we will use classifiers from Heuristics on the banknote authentication dataset. First we load in the data and split into training and test datasets:

using CSV, DataFrames
df = DataFrame(CSV.File("data_banknote_authentication.txt",
    header=[:variance, :skewness, :curtosis, :entropy, :class]))
1372×5 DataFrame
  Row │ variance  skewness   curtosis   entropy   class
      │ Float64   Float64    Float64    Float64   Int64
──────┼─────────────────────────────────────────────────
    1 │  3.6216     8.6661   -2.8073    -0.44699      0
    2 │  4.5459     8.1674   -2.4586    -1.4621       0
    3 │  3.866     -2.6383    1.9242     0.10645      0
    4 │  3.4566     9.5228   -4.0112    -3.5944       0
    5 │  0.32924   -4.4552    4.5718    -0.9888       0
    6 │  4.3684     9.6718   -3.9606    -3.1625       0
    7 │  3.5912     3.0129    0.72888    0.56421      0
    8 │  2.0922    -6.81      8.4636    -0.60216      0
  ⋮   │    ⋮          ⋮          ⋮         ⋮        ⋮
 1366 │ -4.5046    -5.8126   10.8867    -0.52846      1
 1367 │ -2.41       3.7433   -0.40215   -1.2953       1
 1368 │  0.40614    1.3492   -1.4501    -0.55949      1
 1369 │ -1.3887    -4.8773    6.4774     0.34179      1
 1370 │ -3.7503   -13.4586   17.5932    -2.7771       1
 1371 │ -3.5637    -8.3827   12.393     -1.2823       1
 1372 │ -2.5419    -0.65804   2.6842     1.1952       1
                                       1357 rows omitted
X = df[:, 1:4]
y = df[:, 5]
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
                                                      seed=1)

Random Forest Classifier

We will use a GridSearch to fit a RandomForestClassifier with some basic parameter validation:

grid = IAI.GridSearch(
    IAI.RandomForestClassifier(
        random_seed=1,
    ),
    max_depth=5:10,
)
IAI.fit!(grid, train_X, train_y)
All Grid Results:

 Row │ max_depth  train_score  valid_score  rank_valid_score
     │ Int64      Float64      Float64      Int64
─────┼───────────────────────────────────────────────────────
   1 │         5     0.896915     0.866758                 6
   2 │         6     0.938223     0.899417                 5
   3 │         7     0.956856     0.912661                 4
   4 │         8     0.963755     0.920417                 3
   5 │         9     0.965845     0.920693                 2
   6 │        10     0.966141     0.921069                 1

Best Params:
  max_depth => 10

Best Model - Fitted RandomForestClassifier

We can make predictions on new data using predict:

IAI.predict(grid, test_X)
412-element Vector{Int64}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the misclassification on the training set:

IAI.score(grid, train_X, train_y, criterion=:misclassification)
1.0

Or the AUC on the test set:

IAI.score(grid, test_X, test_y, criterion=:auc)
0.9999761376381036

We can also look at the variable importance:

IAI.variable_importance(IAI.get_learner(grid))
4×2 DataFrame
 Row │ Feature   Importance
     │ Symbol    Float64
─────┼──────────────────────
   1 │ variance   0.549464
   2 │ skewness   0.251803
   3 │ curtosis   0.143985
   4 │ entropy    0.0547478

XGBoost Classifier

We will use a GridSearch to fit an XGBoostClassifier with some basic parameter validation:

grid = IAI.GridSearch(
    IAI.XGBoostClassifier(
        random_seed=1,
    ),
    max_depth=2:5,
    num_round=[20, 50, 100],
)
IAI.fit!(grid, train_X, train_y)
All Grid Results:

 Row │ num_round  max_depth  train_score  valid_score  rank_valid_score
     │ Int64      Int64      Float64      Float64      Int64
─────┼──────────────────────────────────────────────────────────────────
   1 │        20          2     0.909192     0.886487                12
   2 │        20          3     0.973674     0.92655                 11
   3 │        20          4     0.987791     0.941853                 9
   4 │        20          5     0.989819     0.936374                10
   5 │        50          2     0.984743     0.966999                 2
   6 │        50          3     0.995245     0.954592                 5
   7 │        50          4     0.996033     0.952682                 6
   8 │        50          5     0.996368     0.949697                 8
   9 │       100          2     0.995576     0.982432                 1
  10 │       100          3     0.996672     0.958094                 3
  11 │       100          4     0.99691      0.957024                 4
  12 │       100          5     0.997079     0.952448                 7

Best Params:
  num_round => 100
  max_depth => 2

Best Model - Fitted XGBoostClassifier

We can make predictions on new data using predict:

IAI.predict(grid, test_X)
412-element Vector{Int64}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the misclassification on the training set:

IAI.score(grid, train_X, train_y, criterion=:misclassification)
1.0

Or the AUC on the test set:

IAI.score(grid, test_X, test_y, criterion=:auc)
0.9999284129143106

We can also look at the variable importance:

IAI.variable_importance(IAI.get_learner(grid))
4×2 DataFrame
 Row │ Feature   Importance
     │ Symbol    Float64
─────┼──────────────────────
   1 │ variance   0.59187
   2 │ skewness   0.215287
   3 │ curtosis   0.153911
   4 │ entropy    0.0389316

We can calculate the SHAP values:

s = IAI.predict_shap(grid, test_X)
s[:shap_values][1]
412×4 Matrix{Float64}:
  3.54498    3.22913   -2.00503    0.3945
  0.728945  -1.46553    2.35883    0.735524
  3.39881    3.29186   -2.13044    0.300303
  2.95136   -2.37941    2.52937    0.648339
  1.1021     3.35015   -1.98985    0.507956
  4.3507    -1.7485     0.930799  -0.612113
  4.09464   -1.62499    1.86802   -0.612113
  4.34745   -0.151717  -0.125396  -0.313408
  3.2987     3.23863   -2.67317    0.300303
 -0.15059   -1.10272    2.54168   -0.638055
  ⋮
  1.17481   -0.609447  -4.14658    0.41376
 -2.12014   -2.18583    0.292272  -0.725072
 -0.380289  -1.71014   -1.14518   -0.673723
  1.35037   -0.504165  -3.06564   -0.485776
  1.4018    -0.716085  -3.27848    0.507956
 -0.480185  -0.789143  -3.00252    0.41376
 -2.86904   -0.471602  -1.27742    0.460438
 -1.79681   -1.81585    1.27945   -0.682587
 -3.11327   -1.48426    1.06169   -0.700595

We can then use the SHAP library in Python to visualize these results in whichever way we prefer. For example, the following code creates a summary plot:

using PyCall
shap = pyimport("shap")
shap.summary_plot(s[:shap_values][1], Matrix(s[:features]), names(s[:features]))

GLMNet Classifier

We can use a GLMNetCVClassifier to fit a GLMNet model using cross-validation:

lnr = IAI.GLMNetCVClassifier(
    random_seed=1,
    n_folds=10,
)
IAI.fit!(lnr, train_X, train_y)
Fitted GLMNetCVClassifier:
  Constant: 5.28597
  Weights:
    curtosis: -3.73106
    entropy:  -0.345243
    skewness: -2.99037
    variance: -5.54063
  (Higher score indicates stronger prediction for class `1`)

We can access the coefficients from the fitted model with get_prediction_weights and get_prediction_constant:

numeric_weights, categoric_weights = IAI.get_prediction_weights(lnr)
numeric_weights
Dict{Symbol, Float64} with 4 entries:
  :curtosis => -3.73106
  :skewness => -2.99037
  :entropy  => -0.345243
  :variance => -5.54063
categoric_weights
Dict{Symbol, Dict{Any, Float64}}()
IAI.get_prediction_constant(lnr)
5.28596534

We can make predictions on new data using predict:

IAI.predict(lnr, test_X)
412-element Vector{Int64}:
 0
 1
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the misclassification on the training set:

IAI.score(lnr, train_X, train_y, criterion=:misclassification)
0.9864583333333333

Or the AUC on the test set:

IAI.score(lnr, test_X, test_y, criterion=:auc)
0.9999522752762073

We can also look at the variable importance:

IAI.variable_importance(lnr)
4×2 DataFrame
 Row │ Feature   Importance
     │ Symbol    Float64
─────┼──────────────────────
   1 │ skewness   0.352005
   2 │ curtosis   0.320189
   3 │ variance   0.313268
   4 │ entropy    0.0145371