Quick Start Guide: Heuristic Classifiers

In this example we will use classifiers from Heuristics on the banknote authentication dataset. First we load in the data and split into training and test datasets:

using CSV, DataFrames
df = DataFrame(CSV.File("data_banknote_authentication.txt",
    header=[:variance, :skewness, :curtosis, :entropy, :class]))

1372×5 DataFrame
  Row │ variance  skewness   curtosis   entropy   class
      │ Float64   Float64    Float64    Float64   Int64
──────┼─────────────────────────────────────────────────
    1 │  3.6216     8.6661   -2.8073    -0.44699      0
    2 │  4.5459     8.1674   -2.4586    -1.4621       0
    3 │  3.866     -2.6383    1.9242     0.10645      0
    4 │  3.4566     9.5228   -4.0112    -3.5944       0
    5 │  0.32924   -4.4552    4.5718    -0.9888       0
    6 │  4.3684     9.6718   -3.9606    -3.1625       0
    7 │  3.5912     3.0129    0.72888    0.56421      0
    8 │  2.0922    -6.81      8.4636    -0.60216      0
  ⋮   │    ⋮          ⋮          ⋮         ⋮        ⋮
 1366 │ -4.5046    -5.8126   10.8867    -0.52846      1
 1367 │ -2.41       3.7433   -0.40215   -1.2953       1
 1368 │  0.40614    1.3492   -1.4501    -0.55949      1
 1369 │ -1.3887    -4.8773    6.4774     0.34179      1
 1370 │ -3.7503   -13.4586   17.5932    -2.7771       1
 1371 │ -3.5637    -8.3827   12.393     -1.2823       1
 1372 │ -2.5419    -0.65804   2.6842     1.1952       1
                                       1357 rows omitted

X = df[:, 1:4]
y = df[:, 5]
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
                                                      seed=1)

Random Forest Classifier

We will use a GridSearch to fit a RandomForestClassifier with some basic parameter validation:

grid = IAI.GridSearch(
    IAI.RandomForestClassifier(
        random_seed=1,
    ),
    max_depth=5:10,
)
IAI.fit!(grid, train_X, train_y)

All Grid Results:

 Row │ max_depth  train_score  valid_score  rank_valid_score
     │ Int64      Float64      Float64      Int64
─────┼───────────────────────────────────────────────────────
   1 │         5     0.903239     0.871707                 6
   2 │         6     0.938004     0.902482                 5
   3 │         7     0.95689      0.918628                 4
   4 │         8     0.963382     0.923913                 3
   5 │         9     0.965263     0.926137                 1
   6 │        10     0.965293     0.925891                 2

Best Params:
  max_depth => 9

Best Model - Fitted RandomForestClassifier

We can make predictions on new data using predict:

IAI.predict(grid, test_X)

412-element Array{Int64,1}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the misclassification on the training set:

IAI.score(grid, train_X, train_y, criterion=:misclassification)

1.0

Or the AUC on the test set:

IAI.score(grid, test_X, test_y, criterion=:auc)

0.9995943398477585

We can also look at the variable importance:

IAI.variable_importance(IAI.get_learner(grid))

4×2 DataFrame
 Row │ Feature   Importance
     │ Symbol    Float64
─────┼──────────────────────
   1 │ variance   0.554808
   2 │ skewness   0.252052
   3 │ curtosis   0.139902
   4 │ entropy    0.0532378

XGBoost Classifier

We will use a GridSearch to fit an XGBoostClassifier with some basic parameter validation:

grid = IAI.GridSearch(
    IAI.XGBoostClassifier(
        random_seed=1,
    ),
    max_depth=2:5,
    num_round=[20, 50, 100],
)
IAI.fit!(grid, train_X, train_y)

All Grid Results:

 Row │ num_round  max_depth  train_score  valid_score  rank_valid_score
     │ Int64      Int64      Float64      Float64      Int64
─────┼──────────────────────────────────────────────────────────────────
   1 │        20          2     0.902156     0.877842                12
   2 │        20          3     0.970873     0.944634                10
   3 │        20          4     0.987441     0.952704                 9
   4 │        20          5     0.989803     0.940254                11
   5 │        50          2     0.985077     0.962432                 6
   6 │        50          3     0.995074     0.976761                 3
   7 │        50          4     0.995969     0.967541                 5
   8 │        50          5     0.996379     0.959306                 8
   9 │       100          2     0.995566     0.978982                 2
  10 │       100          3     0.996601     0.979295                 1
  11 │       100          4     0.996853     0.968232                 4
  12 │       100          5     0.997185     0.961939                 7

Best Params:
  num_round => 100
  max_depth => 3

Best Model - Fitted XGBoostClassifier

We can make predictions on new data using predict:

IAI.predict(grid, test_X)

412-element Array{Int64,1}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the misclassification on the training set:

IAI.score(grid, train_X, train_y, criterion=:misclassification)

1.0

Or the AUC on the test set:

IAI.score(grid, test_X, test_y, criterion=:auc)

0.9999522752762073

We can also look at the variable importance:

IAI.variable_importance(IAI.get_learner(grid))

4×2 DataFrame
 Row │ Feature   Importance
     │ Symbol    Float64
─────┼──────────────────────
   1 │ variance  0.616981
   2 │ skewness  0.247354
   3 │ curtosis  0.130078
   4 │ entropy   0.00558724