Optimal Imputation

Quick Start Guide: Optimal Imputation

This is an R version of the corresponding OptImpute quick start guide.

On this page we show examples of how to use the imputation methods of OptImpute on the echocardiogram dataset:

df <- read.table(
    "echocardiogram.data",
    sep = ",",
    na.strings = "?",
    col.names = c("survival", "alive", "age_at_heart_attack", "pe", "fs",
                  "epss", "lvdd", "wm_score", "wm_index", "mult", "name",
                  "group", "alive_at_one"),
)
  survival alive age_at_heart_attack pe    fs   epss  lvdd wm_score wm_index
1       11     0                  71  0 0.260  9.000 4.600       14     1.00
2       19     0                  72  0 0.380  6.000 4.100       14     1.70
3       16     0                  55  0 0.260  4.000 3.420       14     1.00
4       57     0                  60  0 0.253 12.062 4.603       16     1.45
   mult name group alive_at_one
1 1.000 name     1            0
2 0.588 name     1            0
3 1.000 name     1            0
4 0.788 name     1            0
 [ reached 'max' / getOption("max.print") -- omitted 128 rows ]

There are a number of missing values in the dataset:

colMeans(is.na(df))
           survival               alive age_at_heart_attack                  pe
        0.015151515         0.007575758         0.037878788         0.007575758
                 fs                epss                lvdd            wm_score
        0.060606061         0.113636364         0.083333333         0.030303030
           wm_index                mult                name               group
        0.007575758         0.030303030         0.000000000         0.166666667
       alive_at_one
        0.439393939

Simple Imputation

We can use impute to simply fill the missing values in a DataFrame :

iai::impute(df)
  survival alive age_at_heart_attack pe    fs   epss  lvdd wm_score wm_index
1       11     0                  71  0 0.260  9.000 4.600       14     1.00
2       19     0                  72  0 0.380  6.000 4.100       14     1.70
3       16     0                  55  0 0.260  4.000 3.420       14     1.00
4       57     0                  60  0 0.253 12.062 4.603       16     1.45
   mult name group alive_at_one
1 1.000 name     1            0
2 0.588 name     1            0
3 1.000 name     1            0
4 0.788 name     1            0
 [ reached 'max' / getOption("max.print") -- omitted 128 rows ]

We can control the method to use for imputation by passing the method:

iai::impute(df, "opt_svm")

If you don't know which imputation method or parameter values are best, you can define the grid of parameters to be searched over:

iai::impute(df, list(method = c("opt_knn", "opt_svm")))

You can also use impute_cv to conduct the search with cross-validation:

iai::impute_cv(df, list(method = c("opt_knn", "opt_svm")))

Learner Interface

You can also use OptImpute using the same learner interface as other IAI packages. In particular, you should use this approach when you want to properly conduct an out-of-sample evaluation of performance.

We will split the data into training and testing:

split <- iai::split_data("imputation", df, seed = 1)
train_df <- split$train$X
test_df <- split$test$X

First, create a learner and set parameters as normal:

lnr <- iai::opt_knn_imputation_learner(random_seed = 1)
Julia Object of type OptKNNImputationLearner.
Unfitted OptKNNImputationLearner:
  random_seed: 1

Note that it is also possible to construct the learners with imputation_learner and the method keyword argument (which can be useful when specifying the method programmatically):

lnr <- iai::imputation_learner(method = "opt_knn", random_seed = 1)
Julia Object of type OptKNNImputationLearner.
Unfitted OptKNNImputationLearner:
  random_seed: 1

We can then train the imputation learner on the training dataset with fit:

iai::fit(lnr, train_df)
Julia Object of type OptKNNImputationLearner.
Fitted OptKNNImputationLearner

The fitted learner can then be used to fill missing values with transform:

iai::transform(lnr, test_df)
  survival alive age_at_heart_attack pe   fs epss lvdd wm_score wm_index  mult
1       26     0                  68  0 0.26    5 4.31     12.0     1.00 0.857
2       25     0                  54  0 0.14   13 4.49     15.5     1.19 0.930
3       10     1                  77  0 0.13   16 4.23     18.0     1.80 0.714
4       52     0                  73  0 0.33    6 4.00     14.0     1.00 1.000
  name group alive_at_one
1 name     1            0
2 name     1            0
3 name     1            1
4 name     1            0
 [ reached 'max' / getOption("max.print") -- omitted 36 rows ]

We commonly want to impute on the training set right after fitting the learner, so you can combine these two steps using fit_transform:

iai::fit_transform(lnr, train_df)
  survival alive age_at_heart_attack pe    fs   epss  lvdd wm_score wm_index
1       11     0                  71  0 0.260  9.000 4.600       14     1.00
2       19     0                  72  0 0.380  6.000 4.100       14     1.70
3       16     0                  55  0 0.260  4.000 3.420       14     1.00
4       57     0                  60  0 0.253 12.062 4.603       16     1.45
   mult name group alive_at_one
1 1.000 name     1            0
2 0.588 name     1            0
3 1.000 name     1            0
4 0.788 name     1            0
 [ reached 'max' / getOption("max.print") -- omitted 88 rows ]

To tune parameters, you can use the standard grid_search interface, refer to the documentation on parameter tuning for more information.