OptImpute Learners

OptImpute provides many learners for training imputation models, which we describe on this page along with a guide to their parameters.

Shared Parameters

All of the learners provided by OptImpute are ImputationLearners. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptImpute algorithms.

cluster

cluster is a Bool indicating whether the imputation should be performed in clusters. The default value is false. When set to true, the data is split into clusters and then the imputation is carried out within each cluster. This means each imputation problem is smaller in size, and typically results in a faster imputation without sacrificing the overall imputation quality. The imputation in each cluster is automatically parallelized across any additional workers (see parallelism in IAIBase) which can speed the imputation process up even further.

cluster_max_size

cluster_max_size is an Integer specifying the maximum size of the clusters to use when cluster is set to true. The default value is 500 and should not need to be changed.

max_iter

max_iter is an Integer specifying the maximum number of iterations of optimal imputation to carry out. The default value is 200 and should not need to be changed.

tol

tol is an Real specifying the stopping tolerance for the optimal imputation process. The default value is 1e-4 and should not need to be changed.

warmstart_use_mean

warmstart_use_mean is a Bool specifying whether to use a mean imputation as one of the warmstarts when fitting a learner in the optimal imputation process. The default value is false and typically does not need to be changed.

warmstart_use_knn

warmstart_use_knn is a Bool specifying whether to use a single k-NN imputation as one of the warmstarts when fitting a learner in the optimal imputation process. The default value is true and typically does not need to be changed.

warmstart_num_random_starts

warmstart_num_random_starts is an Integer specifying how many random imputations to use as warmstarts when fitting a learner in the optimal imputation process. The default value is 5 and typically does not need to be changed.

transform_warmstart_use_mean

transform_warmstart_use_mean is a Bool specifying whether to use a mean imputation as one of the warmstarts during the transform step in the optimal imputation process. This parameter defaults to the value specified for warmstart_use_mean and typically does not need to be changed.

transform_warmstart_use_knn

transform_warmstart_use_knn is a Bool specifying whether to use a single k-NN imputation as one of the warmstarts during the transform step in the optimal imputation process. This parameter defaults to the value specified for warmstart_use_knn and typically does not need to be changed.

transform_warmstart_num_random_starts

transform_warmstart_num_random_starts is an Integer specifying how many random imputations to use as warmstarts during the transform step in the optimal imputation process. This parameter defaults to the value specified for warmstart_num_random_starts and typically does not need to be changed.

Optimal k-NN Imputation

The OptKNNImputationLearner is used for optimal k-NN-based imputation. In addition to the shared parameters, it uses the following parameters:

knn_k

knn_k is an Integer specifying the number of neighbors to use in the k-NN calculations. The default value is 10, but tuning this value can lead to better imputations.

Optimal SVM Imputation

The OptSVMImputationLearner is used for optimal SVM-based imputation. In addition to the shared parameters, it uses the following parameters:

svm_gamma

svm_gamma is a Real or Symbol controlling the amount of regularization in the SVM fitting. The default value is :auto, but tuning this value can lead to better imputations. Refer to the documentation on the gamma parameter for OptimalFeatureSelection for more information.

svm_epsilon

svm_epsilon is a Real specifying $\epsilon$ in the hinge loss used when fitting SVR models. The default value is 0.1, but tuning this value can lead to better imputations. Refer to the documentation on the hinge loss parameter for more information.

Optimal Tree Imputation

The OptTreeImputationLearner is used for optimal tree-based imputation. There are no additional parameters beyond the shared parameters.

Heuristic Imputation Methods

OptImpute also offers a number of heuristic imputation methods. Each of these learners support the cluster and cluster_max_size shared parameters.

Random Imputation

The RandImputationLearner imputes each missing value by selecting a random value from the corresponding column in the data. There are no additional parameters.

Mean Imputation

The MeanImputationLearner imputes missing values by replacing them with the mean value in the column for numeric values, and the mode of the column for categoric values. There are no additional parameters.

Zero Imputation

The ZeroImputationLearner imputes missing values by replacing them with 0 for numeric features, and a new level "Null Level" for categoric features. There are no additional parameters.

Single k-NN Imputation

The SingleKNNImputationLearner imputes each missing value by identifying the k-nearest neighbors (based on non-missing values) and taking the mean (for numeric) or mode (for categoric) of the neigbors values in the corresponding column. The knn_k parameter is supported to allow specifying the number of neighbors to consider.

Generic Learner

It is possible to construct a learner for a given imputation method using ImputationLearner with the method keyword argument. The methods that can be specified are:

methodLearner
:opt_knnOptKNNImputationLearner
:opt_svmOptSVMImputationLearner
:opt_treeOptTreeImputationLearner
:randRandImputationLearner
:zeroZeroImputationLearner
:meanMeanImputationLearner
:knnSingleKNNImputationLearner

This is especially useful when using a GridSearch to choose the imputation method, as you can pass a vector of methods to choose from during validation.