Quick Start Guide: Optimal Imputation

This is an R version of the corresponding OptImpute quick start guide.

On this page we show examples of how to use the imputation methods of OptImpute on the echocardiogram dataset:

df <- read.table(
    "echocardiogram.data",
    sep = ",",
    na.strings = "?",
    col.names = c("survival", "alive", "age_at_heart_attack", "pe", "fs",
                  "epss", "lvdd", "wm_score", "wm_index", "mult", "name",
                  "group", "alive_at_one"),
    stringsAsFactors = T,
)

  survival alive age_at_heart_attack pe    fs   epss  lvdd wm_score wm_index
1       11     0                  71  0 0.260  9.000 4.600       14     1.00
2       19     0                  72  0 0.380  6.000 4.100       14     1.70
3       16     0                  55  0 0.260  4.000 3.420       14     1.00
4       57     0                  60  0 0.253 12.062 4.603       16     1.45
   mult name group alive_at_one
1 1.000 name     1            0
2 0.588 name     1            0
3 1.000 name     1            0
4 0.788 name     1            0
 [ reached 'max' / getOption("max.print") -- omitted 128 rows ]

There are a number of missing values in the dataset:

colMeans(is.na(df))

           survival               alive age_at_heart_attack                  pe
        0.015151515         0.007575758         0.037878788         0.007575758
                 fs                epss                lvdd            wm_score
        0.060606061         0.113636364         0.083333333         0.030303030
           wm_index                mult                name               group
        0.007575758         0.030303030         0.000000000         0.166666667
       alive_at_one
        0.439393939

Learner Interface

OptImpute uses the same learner interface as other IAI packages to allow you to properly conduct imputation on data that has been split into training and testing sets.

In this problem, we will use a survival framework and split the data into training and testing:

X <- df[, 3:13]
died <- df[, 2]
times <- df[, 1]
split <- iai::split_data("survival", X, died, times, seed = 1)
train_X <- split$train$X
test_X <- split$test$X

First, create a learner and set parameters as normal:

lnr <- iai::opt_knn_imputation_learner(random_seed = 1)

Julia Object of type OptKNNImputationLearner.
Unfitted OptKNNImputationLearner:
  random_seed: 1

Note that it is also possible to construct the learners with imputation_learner and the method keyword argument (which can be useful when specifying the method programmatically):

lnr <- iai::imputation_learner(method = "opt_knn", random_seed = 1)

Julia Object of type OptKNNImputationLearner.
Unfitted OptKNNImputationLearner:
  random_seed: 1

We can then train the imputation learner on the training dataset with fit:

iai::fit(lnr, train_X)

Julia Object of type OptKNNImputationLearner.
Fitted OptKNNImputationLearner

The fitted learner can then be used to fill missing values with transform:

test_X_imputed <- iai::transform(lnr, test_X)

  age_at_heart_attack pe   fs epss lvdd wm_score wm_index  mult name group
1                  72  0 0.38    6 4.10    14.00    1.700 0.588 name     1
2                  57  0 0.16   22 5.75    18.00    2.250 0.571 name     1
3                  73  0 0.33    6 4.00    14.00    1.000 1.000 name     1
4                  85  1 0.18   19 5.46    13.83    1.380 0.710 name     1
5                  54  0 0.30    7 3.85    10.00    1.667 0.430 name     2
  alive_at_one
1  0.000000000
2  0.000000000
3  0.000000000
4  1.000000000
5  0.001848291
 [ reached 'max' / getOption("max.print") -- omitted 35 rows ]

We commonly want to impute on the training set right after fitting the learner, so you can combine these two steps using fit_transform:

train_X_imputed <- iai::fit_transform(lnr, train_X)

  age_at_heart_attack pe    fs   epss  lvdd wm_score wm_index  mult name group
1                  71  0 0.260  9.000 4.600     14.0    1.000 1.000 name     1
2                  55  0 0.260  4.000 3.420     14.0    1.000 1.000 name     1
3                  60  0 0.253 12.062 4.603     16.0    1.450 0.788 name     1
4                  68  0 0.260  5.000 4.310     12.0    1.000 0.857 name     1
5                  62  0 0.230 31.000 5.430     22.5    1.875 0.857 name     1
  alive_at_one
1            0
2            0
3            0
4            0
5            0
 [ reached 'max' / getOption("max.print") -- omitted 87 rows ]

To tune parameters, you can use the standard grid_search interface, refer to the documentation on parameter tuning for more information.