OptImpute Learners
OptImpute provides many learners for training imputation models, which we describe on this page along with a guide to their parameters.
Shared Parameters
All of the learners provided by OptImpute are ImputationLearner
s. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptImpute algorithms.
cluster
cluster
is a Bool
indicating whether the imputation should be performed in clusters. The default value is false
. When set to true
, the data is split into clusters and then the imputation is carried out within each cluster. This means each imputation problem is smaller in size, and typically results in a faster imputation without sacrificing the overall imputation quality. The imputation in each cluster is automatically parallelized across any additional workers (see parallelism in IAIBase) which can speed the imputation process up even further.
cluster_max_size
cluster_max_size
is an Integer
specifying the maximum size of the clusters to use when cluster
is set to true
. The default value is 500
and should not need to be changed.
max_iter
max_iter
is an Integer
specifying the maximum number of iterations of optimal imputation to carry out. The default value is 200
and should not need to be changed.
tol
tol
is an Real
specifying the stopping tolerance for the optimal imputation process. The default value is 1e-4
and should not need to be changed.
warmstart_use_mean
warmstart_use_mean
is a Bool
specifying whether to use a mean imputation as one of the warmstarts when fitting a learner in the optimal imputation process. The default value is false
and typically does not need to be changed.
warmstart_use_knn
warmstart_use_knn
is a Bool
specifying whether to use a single k-NN imputation as one of the warmstarts when fitting a learner in the optimal imputation process. The default value is true
and typically does not need to be changed.
warmstart_num_random_starts
warmstart_num_random_starts
is an Integer
specifying how many random imputations to use as warmstarts when fitting a learner in the optimal imputation process. The default value is 5
and typically does not need to be changed.
transform_warmstart_use_mean
transform_warmstart_use_mean
is a Bool
specifying whether to use a mean imputation as one of the warmstarts during the transform step in the optimal imputation process. This parameter defaults to the value specified for warmstart_use_mean
and typically does not need to be changed.
transform_warmstart_use_knn
transform_warmstart_use_knn
is a Bool
specifying whether to use a single k-NN imputation as one of the warmstarts during the transform step in the optimal imputation process. This parameter defaults to the value specified for warmstart_use_knn
and typically does not need to be changed.
transform_warmstart_num_random_starts
transform_warmstart_num_random_starts
is an Integer
specifying how many random imputations to use as warmstarts during the transform step in the optimal imputation process. This parameter defaults to the value specified for warmstart_num_random_starts
and typically does not need to be changed.
Optimal k-NN Imputation
The OptKNNImputationLearner
is used for optimal k-NN-based imputation. In addition to the shared parameters, it uses the following parameters:
knn_k
knn_k
is an Integer
specifying the number of neighbors to use in the k-NN calculations. The default value is 10
, but tuning this value can lead to better imputations.
Optimal SVM Imputation
The OptSVMImputationLearner
is used for optimal SVM-based imputation. In addition to the shared parameters, it uses the following parameters:
svm_gamma
svm_gamma
is a Real
or Symbol
controlling the amount of regularization in the SVM fitting. The default value is :auto
, but tuning this value can lead to better imputations. Refer to the documentation on the gamma
parameter for OptimalFeatureSelection for more information.
svm_epsilon
svm_epsilon
is a Real
specifying $\epsilon$ in the hinge loss used when fitting SVR models. The default value is 0.1
, but tuning this value can lead to better imputations. Refer to the documentation on the hinge loss parameter for more information.
Optimal Tree Imputation
The OptTreeImputationLearner
is used for optimal tree-based imputation. There are no additional parameters beyond the shared parameters.
Heuristic Imputation Methods
OptImpute also offers a number of heuristic imputation methods. Each of these learners support the cluster
and cluster_max_size
shared parameters.
Random Imputation
The RandImputationLearner
imputes each missing value by selecting a random value from the corresponding column in the data. There are no additional parameters.
Mean Imputation
The MeanImputationLearner
imputes missing values by replacing them with the mean value in the column for numeric values, and the mode of the column for categoric values. There are no additional parameters.
Zero Imputation
The ZeroImputationLearner
imputes missing values by replacing them with 0
for numeric features, and a new level "Null Level"
for categoric features. There are no additional parameters.
Single k-NN Imputation
The SingleKNNImputationLearner
imputes each missing value by identifying the k-nearest neighbors (based on non-missing values) and taking the mean (for numeric) or mode (for categoric) of the neigbors values in the corresponding column. The knn_k
parameter is supported to allow specifying the number of neighbors to consider.
Generic Learner
It is possible to construct a learner for a given imputation method using ImputationLearner
with the method
keyword argument. The methods that can be specified are:
method | Learner |
---|---|
:opt_knn | OptKNNImputationLearner |
:opt_svm | OptSVMImputationLearner |
:opt_tree | OptTreeImputationLearner |
:rand | RandImputationLearner |
:zero | ZeroImputationLearner |
:mean | MeanImputationLearner |
:knn | SingleKNNImputationLearner |
This is especially useful when using a GridSearch
to choose the imputation method, as you can pass a vector of methods to choose from during validation.