Quick Start Guide: Optimal Imputation

This is an R version of the corresponding OptImpute quick start guide.

On this page we show examples of how to use the imputation methods of OptImpute on the echocardiogram dataset:

df <- read.table(
    "echocardiogram.data",
    sep = ",",
    na.strings = "?",
    col.names = c("survival", "alive", "age_at_heart_attack", "pe", "fs",
                  "epss", "lvdd", "wm_score", "wm_index", "mult", "name",
                  "group", "alive_at_one"),
    stringsAsFactors = T,
)

  survival alive age_at_heart_attack pe    fs   epss  lvdd wm_score wm_index
1       11     0                  71  0 0.260  9.000 4.600       14     1.00
2       19     0                  72  0 0.380  6.000 4.100       14     1.70
3       16     0                  55  0 0.260  4.000 3.420       14     1.00
4       57     0                  60  0 0.253 12.062 4.603       16     1.45
   mult name group alive_at_one
1 1.000 name     1            0
2 0.588 name     1            0
3 1.000 name     1            0
4 0.788 name     1            0
 [ reached 'max' / getOption("max.print") -- omitted 128 rows ]

There are a number of missing values in the dataset:

colMeans(is.na(df))

           survival               alive age_at_heart_attack                  pe
        0.015151515         0.007575758         0.037878788         0.007575758
                 fs                epss                lvdd            wm_score
        0.060606061         0.113636364         0.083333333         0.030303030
           wm_index                mult                name               group
        0.007575758         0.030303030         0.000000000         0.166666667
       alive_at_one
        0.439393939

Simple Imputation

We can use impute to simply fill the missing values in a DataFrame :

df_imputed <- iai::impute(df, random_seed = 1)

  survival alive age_at_heart_attack pe    fs   epss  lvdd wm_score wm_index
1       11     0                  71  0 0.260  9.000 4.600       14     1.00
2       19     0                  72  0 0.380  6.000 4.100       14     1.70
3       16     0                  55  0 0.260  4.000 3.420       14     1.00
4       57     0                  60  0 0.253 12.062 4.603       16     1.45
   mult name group alive_at_one
1 1.000 name     1            0
2 0.588 name     1            0
3 1.000 name     1            0
4 0.788 name     1            0
 [ reached 'max' / getOption("max.print") -- omitted 128 rows ]

We can control the method to use for imputation by passing the method:

df_imputed <- iai::impute(df, "opt_tree", random_seed = 1)

If you don't know which imputation method or parameter values are best, you can define the grid of parameters to be searched over:

df_imputed <- iai::impute(df, list(method = c("opt_knn", "opt_tree")),
                          random_seed = 1)

You can also use impute_cv to conduct the search with cross-validation:

df_imputed <- iai::impute_cv(df, list(method = c("opt_knn", "opt_tree")),
                             random_seed = 123)

You can also use OptImpute using the same learner interface as other IAI packages. In particular, you should use this approach when you want to properly conduct an out-of-sample evaluation of performance.

In this problem, we will use a survival framework and split the data into training and testing:

X <- df[, 3:13]
died <- df[, 2]
times <- df[, 1]
split <- iai::split_data("survival", X, died, times, seed = 1)
train_X <- split$train$X
test_X <- split$test$X

First, create a learner and set parameters as normal:

lnr <- iai::opt_knn_imputation_learner(random_seed = 1)

Julia Object of type OptKNNImputationLearner.
Unfitted OptKNNImputationLearner:
  random_seed: 1

Note that it is also possible to construct the learners with imputation_learner and the method keyword argument (which can be useful when specifying the method programmatically):

lnr <- iai::imputation_learner(method = "opt_knn", random_seed = 1)

Julia Object of type OptKNNImputationLearner.
Unfitted OptKNNImputationLearner:
  random_seed: 1

We can then train the imputation learner on the training dataset with fit:

iai::fit(lnr, train_X)

Julia Object of type OptKNNImputationLearner.
Fitted OptKNNImputationLearner

The fitted learner can then be used to fill missing values with transform:

test_X_imputed <- iai::transform(lnr, test_X)

  age_at_heart_attack pe   fs epss lvdd wm_score wm_index  mult name group
1                  72  0 0.38  6.0 4.10    14.00     1.70 0.588 name     1
2                  57  0 0.16 22.0 5.75    18.00     2.25 0.571 name     1
3                  73  0 0.33  6.0 4.00    14.00     1.00 1.000 name     1
4                  85  1 0.18 19.0 5.46    13.83     1.38 0.710 name     1
5                  64  0 0.19  5.9 3.48    10.00     1.11 0.640 name     2
  alive_at_one
1   0.00000000
2   0.00000000
3   0.00000000
4   1.00000000
5   0.01118692
 [ reached 'max' / getOption("max.print") -- omitted 35 rows ]

We commonly want to impute on the training set right after fitting the learner, so you can combine these two steps using fit_transform:

train_X_imputed <- iai::fit_transform(lnr, train_X)

  age_at_heart_attack pe    fs   epss  lvdd wm_score wm_index  mult name group
1                  71  0 0.260  9.000 4.600     14.0    1.000 1.000 name     1
2                  55  0 0.260  4.000 3.420     14.0    1.000 1.000 name     1
3                  60  0 0.253 12.062 4.603     16.0    1.450 0.788 name     1
4                  68  0 0.260  5.000 4.310     12.0    1.000 0.857 name     1
5                  62  0 0.230 31.000 5.430     22.5    1.875 0.857 name     1
  alive_at_one
1            0
2            0
3            0
4            0
5            0
 [ reached 'max' / getOption("max.print") -- omitted 87 rows ]

To tune parameters, you can use the standard grid_search interface, refer to the documentation on parameter tuning for more information.

Quick Start Guide: Optimal Imputation

Simple Imputation

Learner Interface