# OptImpute Learners

OptImpute provides many learners for training imputation models, which we describe on this page along with a guide to their parameters.

## Shared Parameters

All of the learners provided by OptImpute are `ImputationLearner`

s. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptImpute algorithms.

`cluster`

`cluster`

is a `Bool`

indicating whether the imputation should be performed in clusters. The default value is `false`

. When set to `true`

, the data is split into clusters and then the imputation is carried out within each cluster. This means each imputation problem is smaller in size, and typically results in a faster imputation without sacrificing the overall imputation quality. The imputation in each cluster is automatically parallelized across any additional workers (see parallelism in IAIBase) which can speed the imputation process up even further.

`cluster_max_size`

`cluster_max_size`

is an `Integer`

specifying the maximum size of the clusters to use when `cluster`

is set to `true`

. The default value is `500`

and should not need to be changed.

`max_iter`

`max_iter`

is an `Integer`

specifying the maximum number of iterations of optimal imputation to carry out. The default value is `200`

and should not need to be changed.

`tol`

`tol`

is an `Real`

specifying the stopping tolerance for the optimal imputation process. The default value is `1e-4`

and should not need to be changed.

`warmstart_use_mean`

`warmstart_use_mean`

is a `Bool`

specifying whether to use a mean imputation as one of the warmstarts when fitting a learner in the optimal imputation process. The default value is `false`

and typically does not need to be changed.

`warmstart_use_knn`

`warmstart_use_knn`

is a `Bool`

specifying whether to use a single k-NN imputation as one of the warmstarts when fitting a learner in the optimal imputation process. The default value is `true`

and typically does not need to be changed.

`warmstart_num_random_starts`

`warmstart_num_random_starts`

is an `Integer`

specifying how many random imputations to use as warmstarts when fitting a learner in the optimal imputation process. The default value is `5`

and typically does not need to be changed.

`transform_warmstart_use_mean`

`transform_warmstart_use_mean`

is a `Bool`

specifying whether to use a mean imputation as one of the warmstarts during the transform step in the optimal imputation process. This parameter defaults to the value specified for `warmstart_use_mean`

and typically does not need to be changed.

`transform_warmstart_use_knn`

`transform_warmstart_use_knn`

is a `Bool`

specifying whether to use a single k-NN imputation as one of the warmstarts during the transform step in the optimal imputation process. This parameter defaults to the value specified for `warmstart_use_knn`

and typically does not need to be changed.

`transform_warmstart_num_random_starts`

`transform_warmstart_num_random_starts`

is an `Integer`

specifying how many random imputations to use as warmstarts during the transform step in the optimal imputation process. This parameter defaults to the value specified for `warmstart_num_random_starts`

and typically does not need to be changed.

## Optimal k-NN Imputation

The `OptKNNImputationLearner`

is used for optimal k-NN-based imputation. In addition to the shared parameters, it uses the following parameters:

`knn_k`

`knn_k`

is an `Integer`

specifying the number of neighbors to use in the k-NN calculations. The default value is `10`

, but tuning this value can lead to better imputations.

## Optimal SVM Imputation

The `OptSVMImputationLearner`

is used for optimal SVM-based imputation. In addition to the shared parameters, it uses the following parameters:

`svm_gamma`

`svm_gamma`

is a `Real`

or `Symbol`

controlling the amount of regularization in the SVM fitting. The default value is `:auto`

, but tuning this value can lead to better imputations. Refer to the documentation on the `gamma`

parameter for OptimalFeatureSelection for more information.

`svm_epsilon`

`svm_epsilon`

is a `Real`

specifying $\epsilon$ in the hinge loss used when fitting SVR models. The default value is `0.1`

, but tuning this value can lead to better imputations. Refer to the documentation on the hinge loss parameter for more information.

## Optimal Tree Imputation

The `OptTreeImputationLearner`

is used for optimal tree-based imputation. There are no additional parameters beyond the shared parameters.

## Heuristic Imputation Methods

OptImpute also offers a number of heuristic imputation methods. Each of these learners support the `cluster`

and `cluster_max_size`

shared parameters.

### Random Imputation

The `RandImputationLearner`

imputes each missing value by selecting a random value from the corresponding column in the data. There are no additional parameters.

### Mean Imputation

The `MeanImputationLearner`

imputes missing values by replacing them with the mean value in the column for numeric values, and the mode of the column for categoric values. There are no additional parameters.

### Single k-NN Imputation

The `SingleKNNImputationLearner`

imputes each missing value by identifying the k-nearest neighbors (based on non-missing values) and taking the mean (for numeric) or mode (for categoric) of the neigbors values in the corresponding column. The `knn_k`

parameter is supported to allow specifying the number of neighbors to consider.

## Generic Learner

It is possible to construct a learner for a given imputation method using `ImputationLearner`

with the `method`

keyword argument. The methods that can be specified are:

`method` | Learner |
---|---|

`:opt_knn` | `OptKNNImputationLearner` |

`:opt_svm` | `OptSVMImputationLearner` |

`:opt_tree` | `OptTreeImputationLearner` |

`:rand` | `RandImputationLearner` |

`:mean` | `MeanImputationLearner` |

`:knn` | `SingleKNNImputationLearner` |

This is especially useful when using a `GridSearch`

to choose the imputation method, as you can pass a vector of methods to choose from during validation.