Quick Start Guide: Optimal Imputation
This is an R version of the corresponding OptImpute quick start guide.
On this page we show examples of how to use the imputation methods of OptImpute on the echocardiogram dataset:
df <- read.table(
"echocardiogram.data",
sep = ",",
na.strings = "?",
col.names = c("survival", "alive", "age_at_heart_attack", "pe", "fs",
"epss", "lvdd", "wm_score", "wm_index", "mult", "name",
"group", "alive_at_one"),
stringsAsFactors = T,
)
survival alive age_at_heart_attack pe fs epss lvdd wm_score wm_index
1 11 0 71 0 0.260 9.000 4.600 14 1.00
2 19 0 72 0 0.380 6.000 4.100 14 1.70
3 16 0 55 0 0.260 4.000 3.420 14 1.00
4 57 0 60 0 0.253 12.062 4.603 16 1.45
mult name group alive_at_one
1 1.000 name 1 0
2 0.588 name 1 0
3 1.000 name 1 0
4 0.788 name 1 0
[ reached 'max' / getOption("max.print") -- omitted 128 rows ]
There are a number of missing values in the dataset:
colMeans(is.na(df))
survival alive age_at_heart_attack pe
0.015151515 0.007575758 0.037878788 0.007575758
fs epss lvdd wm_score
0.060606061 0.113636364 0.083333333 0.030303030
wm_index mult name group
0.007575758 0.030303030 0.000000000 0.166666667
alive_at_one
0.439393939
Learner Interface
OptImpute uses the same learner interface as other IAI packages to allow you to properly conduct imputation on data that has been split into training and testing sets.
In this problem, we will use a survival framework and split the data into training and testing:
X <- df[, 3:13]
died <- df[, 2]
times <- df[, 1]
split <- iai::split_data("survival", X, died, times, seed = 1)
train_X <- split$train$X
test_X <- split$test$X
First, create a learner and set parameters as normal:
lnr <- iai::opt_knn_imputation_learner(random_seed = 1)
Julia Object of type OptKNNImputationLearner.
Unfitted OptKNNImputationLearner:
random_seed: 1
Note that it is also possible to construct the learners with imputation_learner
and the method
keyword argument (which can be useful when specifying the method programmatically):
lnr <- iai::imputation_learner(method = "opt_knn", random_seed = 1)
Julia Object of type OptKNNImputationLearner.
Unfitted OptKNNImputationLearner:
random_seed: 1
We can then train the imputation learner on the training dataset with fit
:
iai::fit(lnr, train_X)
Julia Object of type OptKNNImputationLearner.
Fitted OptKNNImputationLearner
The fitted learner can then be used to fill missing values with transform
:
test_X_imputed <- iai::transform(lnr, test_X)
age_at_heart_attack pe fs epss lvdd wm_score wm_index mult name group
1 72 0 0.38 6 4.10 14.00 1.700 0.588 name 1
2 57 0 0.16 22 5.75 18.00 2.250 0.571 name 1
3 73 0 0.33 6 4.00 14.00 1.000 1.000 name 1
4 85 1 0.18 19 5.46 13.83 1.380 0.710 name 1
5 54 0 0.30 7 3.85 10.00 1.667 0.430 name 2
alive_at_one
1 0.000000000
2 0.000000000
3 0.000000000
4 1.000000000
5 0.001848291
[ reached 'max' / getOption("max.print") -- omitted 35 rows ]
We commonly want to impute on the training set right after fitting the learner, so you can combine these two steps using fit_transform
:
train_X_imputed <- iai::fit_transform(lnr, train_X)
age_at_heart_attack pe fs epss lvdd wm_score wm_index mult name group
1 71 0 0.260 9.000 4.600 14.0 1.000 1.000 name 1
2 55 0 0.260 4.000 3.420 14.0 1.000 1.000 name 1
3 60 0 0.253 12.062 4.603 16.0 1.450 0.788 name 1
4 68 0 0.260 5.000 4.310 12.0 1.000 0.857 name 1
5 62 0 0.230 31.000 5.430 22.5 1.875 0.857 name 1
alive_at_one
1 0
2 0
3 0
4 0
5 0
[ reached 'max' / getOption("max.print") -- omitted 87 rows ]
To tune parameters, you can use the standard grid_search
interface, refer to the documentation on parameter tuning for more information.