Quick Start Guide: Heuristic Classifiers
This is a Python version of the corresponding Heuristics quick start guide.
In this example we will use classifiers from Heuristics on the banknote authentication dataset. First we load in the data and split into training and test datasets:
import pandas as pd
df = pd.read_csv("data_banknote_authentication.txt", header=None,
names=['variance', 'skewness', 'curtosis', 'entropy', 'class'])
variance skewness curtosis entropy class
0 3.62160 8.66610 -2.80730 -0.44699 0
1 4.54590 8.16740 -2.45860 -1.46210 0
2 3.86600 -2.63830 1.92420 0.10645 0
3 3.45660 9.52280 -4.01120 -3.59440 0
4 0.32924 -4.45520 4.57180 -0.98880 0
5 4.36840 9.67180 -3.96060 -3.16250 0
6 3.59120 3.01290 0.72888 0.56421 0
... ... ... ... ... ...
1365 -4.50460 -5.81260 10.88670 -0.52846 1
1366 -2.41000 3.74330 -0.40215 -1.29530 1
1367 0.40614 1.34920 -1.45010 -0.55949 1
1368 -1.38870 -4.87730 6.47740 0.34179 1
1369 -3.75030 -13.45860 17.59320 -2.77710 1
1370 -3.56370 -8.38270 12.39300 -1.28230 1
1371 -2.54190 -0.65804 2.68420 1.19520 1
[1372 rows x 5 columns]
from interpretableai import iai
X = df.iloc[:, 0:4]
y = df.iloc[:, 4]
(train_X, train_y), (test_X, test_y) = iai.split_data('classification', X, y,
seed=1)
Random Forest Classifier
We will use a GridSearch
to fit a RandomForestClassifier
with some basic parameter validation:
grid = iai.GridSearch(
iai.RandomForestClassifier(
random_seed=1,
),
max_depth=range(5, 11),
)
grid.fit(train_X, train_y)
We can make predictions on new data using predict
:
grid.predict(test_X)
array([0, 0, 0, ..., 1, 1, 1])
We can evaluate the quality of the model using score
with any of the supported loss functions. For example, the misclassification on the training set:
grid.score(train_X, train_y, criterion='misclassification')
1.0
Or the AUC on the test set:
grid.score(test_X, test_y, criterion='auc')
0.9999761376381036
We can also look at the variable importance:
grid.get_learner().variable_importance()
Feature Importance
0 variance 0.549464
1 skewness 0.251803
2 curtosis 0.143985
3 entropy 0.054748
XGBoost Classifier
We will use a GridSearch
to fit an XGBoostClassifier
with some basic parameter validation:
grid = iai.GridSearch(
iai.XGBoostClassifier(
random_seed=1,
),
max_depth=range(2, 6),
num_round=[20, 50, 100],
)
grid.fit(train_X, train_y)
We can make predictions on new data using predict
:
grid.predict(test_X)
array([0, 0, 0, ..., 1, 1, 1])
We can evaluate the quality of the model using score
with any of the supported loss functions. For example, the misclassification on the training set:
grid.score(train_X, train_y, criterion='misclassification')
1.0
Or the AUC on the test set:
grid.score(test_X, test_y, criterion='auc')
0.9999284129143106
We can also look at the variable importance:
grid.get_learner().variable_importance()
Feature Importance
0 variance 0.591870
1 skewness 0.215287
2 curtosis 0.153911
3 entropy 0.038932
We can calculate the SHAP values:
s = grid.predict_shap(test_X)
s['shap_values'][0]
array([[ 3.54498, 3.22913, -2.00503, 0.3945 ],
[ 0.72894, -1.46553, 2.35883, 0.73552],
[ 3.39881, 3.29186, -2.13044, 0.3003 ],
...,
[-2.86904, -0.4716 , -1.27742, 0.46044],
[-1.79681, -1.81585, 1.27945, -0.68259],
[-3.11327, -1.48426, 1.06169, -0.7006 ]])
We can then use the SHAP library to visualize these results in whichever way we prefer.
GLMNet Classifier
We can use a GLMNetCVClassifier
to fit a GLMNet model using cross-validation:
lnr = iai.GLMNetCVClassifier(
random_seed=1,
n_folds=10,
)
lnr.fit(train_X, train_y)
Fitted GLMNetCVClassifier:
Constant: 5.28597
Weights:
curtosis: -3.73106
entropy: -0.345243
skewness: -2.99037
variance: -5.54063
(Higher score indicates stronger prediction for class `1`)
We can access the coefficients from the fitted model with get_prediction_weights
and get_prediction_constant
:
numeric_weights, categoric_weights = lnr.get_prediction_weights()
numeric_weights
{'curtosis': -3.731063007781, 'skewness': -2.990372270991, 'entropy': -0.345242729262, 'variance': -5.540633797609}
categoric_weights
{}
lnr.get_prediction_constant()
5.285965343996
We can make predictions on new data using predict
:
lnr.predict(test_X)
array([0, 1, 0, ..., 1, 1, 1], dtype=int32)
We can evaluate the quality of the model using score
with any of the supported loss functions. For example, the misclassification on the training set:
lnr.score(train_X, train_y, criterion='misclassification')
0.986458333333
Or the AUC on the test set:
lnr.score(test_X, test_y, criterion='auc')
0.999952275276
We can also look at the variable importance:
lnr.variable_importance()
Feature Importance
0 skewness 0.352005
1 curtosis 0.320189
2 variance 0.313268
3 entropy 0.014537