# Quick Start Guide: Optimal Feature Selection for Classification

This is a Python version of the corresponding OptimalFeatureSelection quick start guide.

In this example we will use Optimal Feature Selection on the Mushroom dataset, where the goal is to distinguish poisonous from edible mushrooms.

First we load in the data and split into training and test datasets:

import pandas as pd
"agaricus-lepiota.data",
dtype='category',
names=["target", "cap_shape", "cap_surface", "cap_color", "bruises",
"odor", "gill_attachment", "gill_spacing", "gill_size",
"gill_color", "stalk_shape", "stalk_root", "stalk_surface_above",
"stalk_surface_below", "stalk_color_above", "stalk_color_below",
"veil_type", "veil_color", "ring_number", "ring_type",
"spore_color", "population", "habitat"],
)

     target cap_shape cap_surface  ... spore_color population habitat
0         p         x           s  ...           k          s       u
1         e         x           s  ...           n          n       g
2         e         b           s  ...           n          n       m
3         p         x           y  ...           k          s       u
4         e         x           s  ...           n          a       g
5         e         x           y  ...           k          n       g
6         e         b           s  ...           k          n       m
...     ...       ...         ...  ...         ...        ...     ...
8117      p         k           s  ...           w          v       d
8118      p         k           y  ...           w          v       d
8119      e         k           s  ...           b          c       l
8120      e         x           s  ...           b          v       l
8121      e         f           s  ...           b          c       l
8122      p         k           y  ...           w          v       l
8123      e         x           s  ...           o          c       l

[8124 rows x 23 columns]
from interpretableai import iai
X = df.iloc[:, 1:22]
y = df.iloc[:, 0]
(train_X, train_y), (test_X, test_y) = iai.split_data('classification', X, y,
seed=1)


### Model Fitting

We will use a GridSearch to fit an OptimalFeatureSelectionClassifier:

grid = iai.GridSearch(
iai.OptimalFeatureSelectionClassifier(
random_seed=1,
),
sparsity=range(1, 11),
)
grid.fit(train_X, train_y, validation_criterion='auc')

All Grid Results:

│ Row │ sparsity │ train_score │ valid_score │ rank_valid_score │
│     │ Int64    │ Float64     │ Float64     │ Int64            │
├─────┼──────────┼─────────────┼─────────────┼──────────────────┤
│ 1   │ 1        │ 0.507053    │ 0.892471    │ 10               │
│ 2   │ 2        │ 0.616231    │ 0.945726    │ 9                │
│ 3   │ 3        │ 0.711853    │ 0.969207    │ 8                │
│ 4   │ 4        │ 0.7645      │ 0.977629    │ 7                │
│ 5   │ 5        │ 0.802974    │ 0.981979    │ 6                │
│ 6   │ 6        │ 0.824427    │ 0.984656    │ 5                │
│ 7   │ 7        │ 0.877073    │ 0.997937    │ 3                │
│ 8   │ 8        │ 0.916794    │ 0.999686    │ 1                │
│ 9   │ 9        │ 0.874916    │ 0.995357    │ 4                │
│ 10  │ 10       │ 0.907388    │ 0.99862     │ 2                │

Best Params:
sparsity => 8

Best Model - Fitted OptimalFeatureSelectionClassifier:
Constant: 1.35687
Weights:
gill_color==b:   1.24676
gill_size==b:   -1.20485
gill_size==n:    1.20485
odor==a:        -3.81186
odor==f:         2.98145
odor==l:        -3.81858
odor==n:        -3.7076
spore_color==r:  5.98818

The model selected a sparsity of 8 as the best parameter, but we observe that the validation scores are close for many of the parameters. We can use the results of the grid search to explore the tradeoff between the complexity of the regression and the quality of predictions:

results = grid.get_grid_results()
ax = results.plot(x='sparsity', y='valid_score', legend=False)
ax.set_xlabel('Sparsity')
ax.set_ylabel('Validation AUC')


We see that the quality of the model quickly increases with additional terms to reach AUC 0.98 with four terms. After this, additional terms increase the quality more slowly, reaching AUC close to 1 with 7 terms. Depending on the application, we might decide to choose a lower sparsity for the final model than the value chosen by the grid search.

We can make predictions on new data using predict:

grid.predict(test_X)

array(['e', 'e', 'e', ..., 'e', 'p', 'e'], dtype=object)

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the misclassification on the training set:

grid.score(train_X, train_y, criterion='misclassification')

0.9934939335326183

Or the AUC on the test set:

grid.score(test_X, test_y, criterion='auc')

0.9997997100178709