Learners

OptImpute Learners

OptImpute provides many learners for training imputation models, which we describe on this page along with a guide to their parameters.

Shared Parameters

All of the learners provided by OptImpute are ImputationLearners. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptImpute algorithms.

cluster

cluster is a Bool indicating whether the imputation should be performed in clusters. The default value is false. When set to true, the data is split into clusters and then the imputation is carried out within each cluster. This means each imputation problem is smaller in size, and typically results in a faster imputation without sacrificing the overall imputation quality. The imputation in each cluster is automatically parallelized across any additional workers (see parallelism in IAIBase) which can speed the imputation process up even further.

cluster_max_size

cluster_max_size is an Integer specifying the maximum size of the clusters to use when cluster is set to true. The default value is 500 and should not need to be changed.

max_iter

max_iter is an Integer specifying the maximum number of iterations of optimal imputation to carry out. The default value is 200 and should not need to be changed.

tol

tol is an Real specifying the stopping tolerance for the optimal imputation process. The default value is 1e-4 and should not need to be changed.

warmstart_use_mean

warmstart_use_mean is a Bool specifying whether to use a mean imputation as one of the warmstarts in the optimal imputation process. The default value is true and does not need to be changed.

warmstart_use_knn

warmstart_use_knn is a Bool specifying whether to use a single k-NN imputation as one of the warmstarts in the optimal imputation process. The default value is true and does not need to be changed.

warmstart_num_random_starts

warmstart_num_random_starts is an Integer specifying how many random imputations to use as warmstarts in the optimal imputation process. The default value is 5 and does not need to be changed.

Optimal k-NN Imputation

The OptKNNImputationLearner is used for optimal k-NN-based imputation. In addition to the shared parameters, it uses the following parameters:

knn_k

knn_k is an Integer specifying the number of neighbors to use in the k-NN calculations. The default value is 10, but tuning this value can lead to better imputations.

Optimal SVM Imputation

The OptSVMImputationLearner is used for optimal SVM-based imputation. In addition to the shared parameters, it uses the following parameters:

svm_c

svm_c is a Real controlling the amount of regularization in the SVM fitting. The default value is 1, but tuning this value can lead to better imputations.

svm_kernel

svm_kernel specifies the kernel to use during SVM-based imputation. The default value is :rbf, meaning the radial basis function will be used. This typically gives the highest-quality imputations. The other option is :linear, which will use a linear kernel. This approach can be faster, but the quality of the imputations may be lower.

Optimal Tree Imputation

The OptTreeImputationLearner is used for optimal tree-based imputation. There are no additional parameters beyond the shared parameters.

Heuristic Imputation Methods

OptImpute also offers a number of heuristic imputation methods. Each of these learners support the cluster and cluster_max_size shared parameters.

Random Imputation

The RandImputationLearner imputes each missing value by selecting a random value from the corresponding column in the data. There are no additional parameters.

Mean Imputation

The MeanImputationLearner imputes missing values by replacing them with the mean value in the column for numeric values, and the mode of the column for categoric values. There are no additional parameters.

Single k-NN Imputation

The SingleKNNImputationLearner imputes each missing value by identifying the k-nearest neighbors (based on non-missing values) and taking the mean (for numeric) or mode (for categoric) of the neigbors values in the corresponding column. The knn_k parameter is supported to allow specifying the number of neighbors to consider.

Generic Learner

It is possible to construct a learner for a given imputation method using ImputationLearner with the method keyword argument. The methods that can be specified are:

methodLearner
:opt_knnOptKNNImputationLearner
:opt_svmOptSVMImputationLearner
:opt_treeOptTreeImputationLearner
:randRandImputationLearner
:meanMeanImputationLearner
:knnSingleKNNImputationLearner

This is especially useful when using a GridSearch to choose the imputation method, as you can pass a vector of methods to choose from during validation.