Parameter Tuning

When trying to select the right imputation model for a dataset, there are two considerations: which method to use for imputation and how to tune its parameters.

We recommend using the GridSearch to select both the method and parameters used. The easiest way to do this is to validate over multiple parameter grids, with one grid for each method under consideration that contains the parameters to tune for this method. The following example constructs a grid search with multiple parameter grids to select between the three optimal imputation method, while tuning the parameters knn_k and svm_gamma for the appropriate methods:

grid = IAI.GridSearch(
        (method=:opt_knn, knn_k=[5, 10, 15]),
GridSearch - Unfitted OptKNNImputationLearner:
  random_seed: 1

GridSearch Params:

We can see that the parameter grids have been expanded into four options for the grid search to select between. We can now run the grid search to find the best method and parameter combination:!(grid, df)
All Grid Results:

 Row │ method    knn_k    train_score  valid_score  rank_valid_score
     │ Symbol    Int64?   Float64      Float64      Int64
   1 │ opt_knn         5          NaN     0.150609                 3
   2 │ opt_knn        10          NaN     0.151021                 4
   3 │ opt_knn        15          NaN     0.139933                 1
   4 │ opt_tree  missing          NaN     0.147014                 2

Best Params:
  method => opt_knn
  knn_k => 15

Best Model - Fitted OptKNNImputationLearner

Note that for imputation comparisons, a lower validation score is better.

As with other uses of GridSearch, you can use fit_cv! in place of fit! in order to conduct cross-validation. You can also use fit_transform_cv! to combine the fit_cv! and transform steps.

Which parameters to tune?

In most cases, there are very few parameters that need to be tuned for imputation methods. We recommend simply tuning over the different imputation methods and tuning knn_k and svm_gamma where appropriate, as shown in the example above.

The only other decision that needs to be made in most cases is whether to use the clustering approach. If the imputation is taking too long, you might like to enable this option to see the effect on imputation time.