Quick Start Guide: Heuristic Classifiers

This is a Python version of the corresponding Heuristics quick start guide.

In this example we will use classifiers from Heuristics on the banknote authentication dataset. First we load in the data and split into training and test datasets:

import pandas as pd
df = pd.read_csv("data_banknote_authentication.txt", header=None,
                 names=['variance', 'skewness', 'curtosis', 'entropy', 'class'])
      variance  skewness  curtosis  entropy  class
0      3.62160   8.66610  -2.80730 -0.44699      0
1      4.54590   8.16740  -2.45860 -1.46210      0
2      3.86600  -2.63830   1.92420  0.10645      0
3      3.45660   9.52280  -4.01120 -3.59440      0
4      0.32924  -4.45520   4.57180 -0.98880      0
5      4.36840   9.67180  -3.96060 -3.16250      0
6      3.59120   3.01290   0.72888  0.56421      0
...        ...       ...       ...      ...    ...
1365  -4.50460  -5.81260  10.88670 -0.52846      1
1366  -2.41000   3.74330  -0.40215 -1.29530      1
1367   0.40614   1.34920  -1.45010 -0.55949      1
1368  -1.38870  -4.87730   6.47740  0.34179      1
1369  -3.75030 -13.45860  17.59320 -2.77710      1
1370  -3.56370  -8.38270  12.39300 -1.28230      1
1371  -2.54190  -0.65804   2.68420  1.19520      1

[1372 rows x 5 columns]
from interpretableai import iai
X = df.iloc[:, 0:4]
y = df.iloc[:, 4]
(train_X, train_y), (test_X, test_y) = iai.split_data('classification', X, y,
                                                      seed=1)

Random Forest Classifier

We will use a GridSearch to fit a RandomForestClassifier with some basic parameter validation:

grid = iai.GridSearch(
    iai.RandomForestClassifier(
        random_seed=1,
    ),
    max_depth=range(5, 11),
)
grid.fit(train_X, train_y)

We can make predictions on new data using predict:

grid.predict(test_X)
array([0, 0, 0, ..., 1, 1, 1])

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the misclassification on the training set:

grid.score(train_X, train_y, criterion='misclassification')
1.0

Or the AUC on the test set:

grid.score(test_X, test_y, criterion='auc')
0.9999761376381036

We can also look at the variable importance:

grid.get_learner().variable_importance()
    Feature  Importance
0  variance    0.549464
1  skewness    0.251803
2  curtosis    0.143985
3   entropy    0.054748

XGBoost Classifier

We will use a GridSearch to fit an XGBoostClassifier with some basic parameter validation:

grid = iai.GridSearch(
    iai.XGBoostClassifier(
        random_seed=1,
    ),
    max_depth=range(2, 6),
    num_round=[20, 50, 100],
)
grid.fit(train_X, train_y)

We can make predictions on new data using predict:

grid.predict(test_X)
array([0, 0, 0, ..., 1, 1, 1])

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the misclassification on the training set:

grid.score(train_X, train_y, criterion='misclassification')
1.0

Or the AUC on the test set:

grid.score(test_X, test_y, criterion='auc')
0.9999284129143106

We can also look at the variable importance:

grid.get_learner().variable_importance()
    Feature  Importance
0  variance    0.591870
1  skewness    0.215287
2  curtosis    0.153911
3   entropy    0.038932

We can calculate the SHAP values:

s = grid.predict_shap(test_X)
s['shap_values'][0]
array([[ 3.54498,  3.22913, -2.00503,  0.3945 ],
       [ 0.72894, -1.46553,  2.35883,  0.73552],
       [ 3.39881,  3.29186, -2.13044,  0.3003 ],
       ...,
       [-2.86904, -0.4716 , -1.27742,  0.46044],
       [-1.79681, -1.81585,  1.27945, -0.68259],
       [-3.11327, -1.48426,  1.06169, -0.7006 ]])

We can then use the SHAP library to visualize these results in whichever way we prefer.

GLMNet Classifier

We can use a GLMNetCVClassifier to fit a GLMNet model using cross-validation:

lnr = iai.GLMNetCVClassifier(
    random_seed=1,
    n_folds=10,
)
lnr.fit(train_X, train_y)
Fitted GLMNetCVClassifier:
  Constant: 5.28597
  Weights:
    curtosis: -3.73106
    entropy:  -0.345243
    skewness: -2.99037
    variance: -5.54063
  (Higher score indicates stronger prediction for class `1`)

We can access the coefficients from the fitted model with get_prediction_weights and get_prediction_constant:

numeric_weights, categoric_weights = lnr.get_prediction_weights()
numeric_weights
{'curtosis': -3.731063007781, 'skewness': -2.990372270991, 'entropy': -0.345242729262, 'variance': -5.540633797609}
categoric_weights
{}
lnr.get_prediction_constant()
5.285965343996

We can make predictions on new data using predict:

lnr.predict(test_X)
array([0, 1, 0, ..., 1, 1, 1], dtype=int32)

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the misclassification on the training set:

lnr.score(train_X, train_y, criterion='misclassification')
0.986458333333

Or the AUC on the test set:

lnr.score(test_X, test_y, criterion='auc')
0.999952275276

We can also look at the variable importance:

lnr.variable_importance()
    Feature  Importance
0  skewness    0.352005
1  curtosis    0.320189
2  variance    0.313268
3   entropy    0.014537