Quick Start Guide: Optimal Feature Selection for Classification
This is a Python version of the corresponding OptimalFeatureSelection quick start guide.
In this example we will use Optimal Feature Selection on the Mushroom dataset, where the goal is to distinguish poisonous from edible mushrooms.
First we load in the data and split into training and test datasets:
import pandas as pd
df = pd.read_csv(
"agaricus-lepiota.data",
header=None,
dtype='category',
names=["target", "cap_shape", "cap_surface", "cap_color", "bruises",
"odor", "gill_attachment", "gill_spacing", "gill_size",
"gill_color", "stalk_shape", "stalk_root", "stalk_surface_above",
"stalk_surface_below", "stalk_color_above", "stalk_color_below",
"veil_type", "veil_color", "ring_number", "ring_type",
"spore_color", "population", "habitat"],
)
target cap_shape cap_surface ... spore_color population habitat
0 p x s ... k s u
1 e x s ... n n g
2 e b s ... n n m
3 p x y ... k s u
4 e x s ... n a g
5 e x y ... k n g
6 e b s ... k n m
... ... ... ... ... ... ... ...
8117 p k s ... w v d
8118 p k y ... w v d
8119 e k s ... b c l
8120 e x s ... b v l
8121 e f s ... b c l
8122 p k y ... w v l
8123 e x s ... o c l
[8124 rows x 23 columns]
from interpretableai import iai
X = df.iloc[:, 1:]
y = df.iloc[:, 0]
(train_X, train_y), (test_X, test_y) = iai.split_data('classification', X, y,
seed=1)
Model Fitting
We will use a GridSearch
to fit an OptimalFeatureSelectionClassifier
:
grid = iai.GridSearch(
iai.OptimalFeatureSelectionClassifier(
random_seed=1,
),
sparsity=range(1, 11),
)
grid.fit(train_X, train_y, validation_criterion='auc')
All Grid Results:
Row │ sparsity train_score valid_score rank_valid_score
│ Int64 Float64 Float64 Int64
─────┼──────────────────────────────────────────────────────
1 │ 1 0.526331 0.888682 10
2 │ 2 0.720782 0.969598 9
3 │ 3 0.84893 0.982578 8
4 │ 4 0.804042 0.985475 7
5 │ 5 0.807987 0.988876 6
6 │ 6 0.90012 0.993467 3
7 │ 7 0.82893 0.989336 5
8 │ 8 0.850938 0.992436 4
9 │ 9 0.921751 0.99948 2
10 │ 10 0.923684 0.9996 1
Best Params:
sparsity => 10
Best Model - Fitted OptimalFeatureSelectionClassifier:
Constant: -0.0235654
Weights:
gill_color==b: 1.85026
gill_size==n: 1.66968
odor==a: -3.55251
odor==c: 1.96587
odor==f: 3.13526
odor==l: -3.5631
odor==n: -3.42242
odor==p: 1.97175
spore_color==r: 5.90414
stalk_root==c: 0.236795
(Higher score indicates stronger prediction for class `p`)
The model selected a sparsity of 10 as the best parameter, but we observe that the validation scores are close for many of the parameters. We can use the results of the grid search to explore the tradeoff between the complexity of the regression and the quality of predictions:
results = grid.get_grid_result_summary()
ax = results.plot(x='sparsity', y='valid_score', legend=False)
ax.set_xlabel('Sparsity')
ax.set_ylabel('Validation AUC')
We see that the quality of the model quickly increases with as features are adding, reaching AUC 0.98 with 3 features. After this, additional features increase the quality more slowly, eventually reaching AUC close to 1 with 9 features. Depending on the application, we might decide to choose a lower sparsity for the final model than the value chosen by the grid search.
We can see the relative importance of the selected features with variable_importance
:
grid.get_learner().variable_importance()
Feature Importance
0 odor_n 0.229159
1 odor_f 0.186589
2 odor_l 0.105925
3 gill_size_n 0.104647
4 gill_color_b 0.102792
5 odor_a 0.102271
6 spore_color_r 0.072992
.. ... ...
105 stalk_surface_below_s 0.000000
106 stalk_surface_below_y 0.000000
107 veil_color_n 0.000000
108 veil_color_o 0.000000
109 veil_color_w 0.000000
110 veil_color_y 0.000000
111 veil_type_p 0.000000
[112 rows x 2 columns]
We can make predictions on new data using predict
:
grid.predict(test_X)
array(['e', 'e', 'e', ..., 'p', 'p', 'e'], dtype=object)
We can evaluate the quality of the model using score
with any of the supported loss functions. For example, the misclassification on the training set:
grid.score(train_X, train_y, criterion='misclassification')
0.9949006506066468
Or the AUC on the test set:
grid.score(test_X, test_y, criterion='auc')
0.9997498061165998