Quick Start Guide: Optimal Feature Selection for Classification

In this example we will use optimal feature selection on the Mushroom dataset, where the goal is to distinguish poisonous from edible mushrooms.

First we load in the data and split into training and test datasets:

using CSV, DataFrames
df = DataFrame(CSV.File("agaricus-lepiota.data",
    pool=true,
    header=[:target, :cap_shape, :cap_surface, :cap_color, :bruises, :odor,
            :gill_attachment, :gill_spacing, :gill_size, :gill_color,
            :stalk_shape, :stalk_root, :stalk_surface_above,
            :stalk_surface_below, :stalk_color_above, :stalk_color_below,
            :veil_type, :veil_color, :ring_number, :ring_type, :spore_color,
            :population, :habitat],
))

8124×23 DataFrame. Omitted printing of 17 columns
│ Row  │ target │ cap_shape │ cap_surface │ cap_color │ bruises │ odor   │
│      │ String │ String    │ String      │ String    │ String  │ String │
├──────┼────────┼───────────┼─────────────┼───────────┼─────────┼────────┤
│ 1    │ p      │ x         │ s           │ n         │ t       │ p      │
│ 2    │ e      │ x         │ s           │ y         │ t       │ a      │
│ 3    │ e      │ b         │ s           │ w         │ t       │ l      │
│ 4    │ p      │ x         │ y           │ w         │ t       │ p      │
│ 5    │ e      │ x         │ s           │ g         │ f       │ n      │
│ 6    │ e      │ x         │ y           │ y         │ t       │ a      │
│ 7    │ e      │ b         │ s           │ w         │ t       │ a      │
⋮
│ 8117 │ p      │ k         │ y           │ n         │ f       │ s      │
│ 8118 │ p      │ k         │ s           │ e         │ f       │ y      │
│ 8119 │ p      │ k         │ y           │ n         │ f       │ f      │
│ 8120 │ e      │ k         │ s           │ n         │ f       │ n      │
│ 8121 │ e      │ x         │ s           │ n         │ f       │ n      │
│ 8122 │ e      │ f         │ s           │ n         │ f       │ n      │
│ 8123 │ p      │ k         │ y           │ n         │ f       │ y      │
│ 8124 │ e      │ x         │ s           │ n         │ f       │ n      │

X = df[:, 2:end]
y = df.target
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
                                                      seed=1)

Model Fitting

We will use a GridSearch to fit an OptimalFeatureSelectionClassifier:

grid = IAI.GridSearch(
    IAI.OptimalFeatureSelectionClassifier(
        random_seed=1,
    ),
    sparsity=1:10,
)
IAI.fit!(grid, train_X, train_y, validation_criterion=:auc)

All Grid Results:

│ Row │ sparsity │ train_score │ valid_score │ rank_valid_score │
│     │ Int64    │ Float64     │ Float64     │ Int64            │
├─────┼──────────┼─────────────┼─────────────┼──────────────────┤
│ 1   │ 1        │ 0.512834    │ 0.888682    │ 10               │
│ 2   │ 2        │ 0.720156    │ 0.969598    │ 9                │
│ 3   │ 3        │ 0.776552    │ 0.982218    │ 8                │
│ 4   │ 4        │ 0.803455    │ 0.985475    │ 7                │
│ 5   │ 5        │ 0.80721     │ 0.988876    │ 5                │
│ 6   │ 6        │ 0.84452     │ 0.98828     │ 6                │
│ 7   │ 7        │ 0.828293    │ 0.989336    │ 4                │
│ 8   │ 8        │ 0.849972    │ 0.992436    │ 3                │
│ 9   │ 9        │ 0.934533    │ 0.999783    │ 1                │
│ 10  │ 10       │ 0.923272    │ 0.9996      │ 2                │

Best Params:
  sparsity => 9

Best Model - Fitted OptimalFeatureSelectionClassifier:
  Constant: 0.133011
  Weights:
    gill_color==b:   1.55261
    gill_size==n:    1.86241
    odor==a:        -3.74467
    odor==f:         2.96065
    odor==l:        -3.75129
    odor==n:        -3.64143
    odor==p:         1.68068
    spore_color==r:  5.93176
    stalk_root==c:   0.167442
  (Higher score indicates stronger prediction for class `p`)

The model selected a sparsity of 9 as the best parameter, but we observe that the validation scores are close for many of the parameters. We can use the results of the grid search to explore the tradeoff between the complexity of the regression and the quality of predictions:

using Plots
results = IAI.get_grid_results(grid)
plot(results.sparsity, results.valid_score, xlabel="Sparsity",
     ylabel="Validation AUC", legend=nothing)

We see that the quality of the model quickly increases with as features are adding, reaching AUC 0.98 with 3 features. After this, additional features increase the quality more slowly, eventually reaching AUC close to 1 with 9 features. Depending on the application, we might decide to choose a lower sparsity for the final model than the value chosen by the grid search.

We can see the relative importance of the selected features with variable_importance:

IAI.variable_importance(IAI.get_learner(grid))

112×2 DataFrame
│ Row │ Feature               │ Importance │
│     │ Symbol                │ Float64    │
├─────┼───────────────────────┼────────────┤
│ 1   │ odor_n                │ 0.253547   │
│ 2   │ odor_f                │ 0.183224   │
│ 3   │ gill_size_n           │ 0.121381   │
│ 4   │ odor_l                │ 0.115967   │
│ 5   │ odor_a                │ 0.112103   │
│ 6   │ gill_color_b          │ 0.0896957  │
│ 7   │ spore_color_r         │ 0.0762575  │
⋮
│ 105 │ stalk_surface_below_k │ 0.0        │
│ 106 │ stalk_surface_below_s │ 0.0        │
│ 107 │ stalk_surface_below_y │ 0.0        │
│ 108 │ veil_color_n          │ 0.0        │
│ 109 │ veil_color_o          │ 0.0        │
│ 110 │ veil_color_w          │ 0.0        │
│ 111 │ veil_color_y          │ 0.0        │
│ 112 │ veil_type_p           │ 0.0        │

We can make predictions on new data using predict:

IAI.predict(grid, test_X)

2437-element Array{String,1}:
 "e"
 "e"
 "e"
 "e"
 "e"
 "e"
 "e"
 "p"
 "e"
 "e"
 ⋮
 "p"
 "e"
 "e"
 "e"
 "p"
 "p"
 "p"
 "p"
 "e"

We can evaluate the quality of the model using score with any of the supported loss functions. For example, the misclassification on the training set:

IAI.score(grid, train_X, train_y, criterion=:misclassification)

0.9949006506066468

Or the AUC on the test set:

IAI.score(grid, test_X, test_y, criterion=:auc)

0.9997498061165998