Parameter Tuning
When trying to select the right imputation model for a dataset, there are two considerations: which method to use for imputation and how to tune its parameters.
We recommend using the GridSearch
to select both the method and parameters used. The easiest way to do this is to validate over multiple parameter grids, with one grid for each method under consideration that contains the parameters to tune for this method. The following example constructs a grid search with multiple parameter grids to select between the three optimal imputation method, while tuning the parameters knn_k
and svm_gamma
for the appropriate methods:
grid = IAI.GridSearch(
IAI.ImputationLearner(
random_seed=1,
),
[
(method=:opt_knn, knn_k=[5, 10, 15]),
(method=:opt_tree,),
],
)
GridSearch - Unfitted OptKNNImputationLearner:
random_seed: 1
GridSearch Params:
(method=opt_knn,knn_k=5,)
(method=opt_knn,knn_k=10,)
(method=opt_knn,knn_k=15,)
(method=opt_tree,)
We can see that the parameter grids have been expanded into four options for the grid search to select between. We can now run the grid search to find the best method and parameter combination:
IAI.fit!(grid, df)
All Grid Results:
Row │ method knn_k train_score valid_score rank_valid_score
│ Symbol Int64? Float64 Float64 Int64
─────┼───────────────────────────────────────────────────────────────
1 │ opt_knn 5 NaN 0.150609 3
2 │ opt_knn 10 NaN 0.151021 4
3 │ opt_knn 15 NaN 0.139933 1
4 │ opt_tree missing NaN 0.147014 2
Best Params:
method => opt_knn
knn_k => 15
Best Model - Fitted OptKNNImputationLearner
Note that for imputation comparisons, a lower validation score is better.
As with other uses of GridSearch
, you can use fit_cv!
in place of fit!
in order to conduct cross-validation. You can also use fit_transform_cv!
to combine the fit_cv!
and transform
steps.
Which parameters to tune?
In most cases, there are very few parameters that need to be tuned for imputation methods. We recommend simply tuning over the different imputation methods and tuning knn_k
and svm_gamma
where appropriate, as shown in the example above.
The only other decision that needs to be made in most cases is whether to use the clustering approach. If the imputation is taking too long, you might like to enable this option to see the effect on imputation time.