OptImpute provides many learners for training imputation models, which we describe on this page along with a guide to their parameters.
All of the learners provided by OptImpute are
ImputationLearners. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptImpute algorithms.
cluster is a
Bool indicating whether the imputation should be performed in clusters. The default value is
false. When set to
true, the data is split into clusters and then the imputation is carried out within each cluster. This means each imputation problem is smaller in size, and typically results in a faster imputation without sacrificing the overall imputation quality. The imputation in each cluster is automatically parallelized across any additional workers (see parallelism in IAIBase) which can speed the imputation process up even further.
cluster_max_size is an
Integer specifying the maximum size of the clusters to use when
cluster is set to
true. The default value is
500 and should not need to be changed.
max_iter is an
Integer specifying the maximum number of iterations of optimal imputation to carry out. The default value is
200 and should not need to be changed.
tol is an
Real specifying the stopping tolerance for the optimal imputation process. The default value is
1e-4 and should not need to be changed.
warmstart_use_mean is a
Bool specifying whether to use a mean imputation as one of the warmstarts in the optimal imputation process. The default value is
true and does not need to be changed.
warmstart_use_knn is a
Bool specifying whether to use a single k-NN imputation as one of the warmstarts in the optimal imputation process. The default value is
true and does not need to be changed.
warmstart_num_random_starts is an
Integer specifying how many random imputations to use as warmstarts in the optimal imputation process. The default value is
5 and does not need to be changed.
Optimal k-NN Imputation
OptKNNImputationLearner is used for optimal k-NN-based imputation. In addition to the shared parameters, it uses the following parameters:
knn_k is an
Integer specifying the number of neighbors to use in the k-NN calculations. The default value is
10, but tuning this value can lead to better imputations.
Optimal SVM Imputation
OptSVMImputationLearner is used for optimal SVM-based imputation. In addition to the shared parameters, it uses the following parameters:
svm_c is a
Real controlling the amount of regularization in the SVM fitting. The default value is
1, but tuning this value can lead to better imputations.
svm_kernel specifies the kernel to use during SVM-based imputation. The default value is
:rbf, meaning the radial basis function will be used. This typically gives the highest-quality imputations. The other option is
:linear, which will use a linear kernel. This approach can be faster, but the quality of the imputations may be lower.
Optimal Tree Imputation
OptTreeImputationLearner is used for optimal tree-based imputation. There are no additional parameters beyond the shared parameters.
Heuristic Imputation Methods
OptImpute also offers a number of heuristic imputation methods. Each of these learners support the
cluster_max_size shared parameters.
RandImputationLearner imputes each missing value by selecting a random value from the corresponding column in the data. There are no additional parameters.
MeanImputationLearner imputes missing values by replacing them with the mean value in the column for numeric values, and the mode of the column for categoric values. There are no additional parameters.
Single k-NN Imputation
SingleKNNImputationLearner imputes each missing value by identifying the k-nearest neighbors (based on non-missing values) and taking the mean (for numeric) or mode (for categoric) of the neigbors values in the corresponding column. The
knn_k parameter is supported to allow specifying the number of neighbors to consider.
It is possible to construct a learner for a given imputation method using
ImputationLearner with the
method keyword argument. The methods that can be specified are:
This is especially useful when using a
GridSearch to choose the imputation method, as you can pass a vector of methods to choose from during validation.