When trying to select the right imputation model for a dataset, there are two considerations: which method to use for imputation and how to tune its parameters.
We recommend using the
GridSearch to select both the method and parameters used. The easiest way to do this is to validate over multiple parameter grids, with one grid for each method under consideration that contains the parameters to tune for this method. The following example constructs a grid search with multiple parameter grids to select between the three optimal imputation method, while tuning the parameters
svm_gamma for the appropriate methods:
grid = IAI.GridSearch( IAI.ImputationLearner( random_seed=1, ), [ (method=:opt_knn, knn_k=[5, 10, 15]), (method=:opt_tree,), ], )
GridSearch - Unfitted OptKNNImputationLearner: random_seed: 1 GridSearch Params: (method=opt_knn,knn_k=5,) (method=opt_knn,knn_k=10,) (method=opt_knn,knn_k=15,) (method=opt_tree,)
We can see that the parameter grids have been expanded into four options for the grid search to select between. We can now run the grid search to find the best method and parameter combination:
All Grid Results: Row │ method knn_k train_score valid_score rank_valid_score │ Symbol Int64? Float64 Float64 Int64 ─────┼─────────────────────────────────────────────────────────────── 1 │ opt_knn 5 NaN 0.150609 3 2 │ opt_knn 10 NaN 0.151021 4 3 │ opt_knn 15 NaN 0.139933 1 4 │ opt_tree missing NaN 0.147014 2 Best Params: method => opt_knn knn_k => 15 Best Model - Fitted OptKNNImputationLearner
Note that for imputation comparisons, a lower validation score is better.
As with other uses of
GridSearch, you can use
fit_cv! in place of
fit! in order to conduct cross-validation. You can also use
fit_transform_cv! to combine the
Which parameters to tune?
In most cases, there are very few parameters that need to be tuned for imputation methods. We recommend simply tuning over the different imputation methods and tuning
svm_gamma where appropriate, as shown in the example above.
The only other decision that needs to be made in most cases is whether to use the clustering approach. If the imputation is taking too long, you might like to enable this option to see the effect on imputation time.