Quick Start Guide

On this page we show examples of how to use the imputation methods of OptImpute on the echocardiogram dataset:

using CSV, DataFrames
df = CSV.read("echocardiogram.data", DataFrame,
    missingstring="?",
    pool=true,
    header=[:survival, :alive, :age_at_heart_attack, :pe, :fs, :epss, :lvdd,
            :wm_score, :wm_index, :mult, :name, :group, :alive_at_one],
)

132×13 DataFrame
 Row │ survival  alive   age_at_heart_attack  pe      fs        epss      lvdd ⋯
     │ Float64?  Int64?  Float64?             Int64?  Float64?  Float64?  Floa ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │     11.0       0                 71.0       0     0.26      9.0       4 ⋯
   2 │     19.0       0                 72.0       0     0.38      6.0       4
   3 │     16.0       0                 55.0       0     0.26      4.0       3
   4 │     57.0       0                 60.0       0     0.253    12.062     4
   5 │     19.0       1                 57.0       0     0.16     22.0       5 ⋯
   6 │     26.0       0                 68.0       0     0.26      5.0       4
   7 │     13.0       0                 62.0       0     0.23     31.0       5
   8 │     50.0       0                 60.0       0     0.33      8.0       5
  ⋮  │    ⋮        ⋮              ⋮             ⋮        ⋮         ⋮         ⋮ ⋱
 126 │     17.0       0            missing         0     0.09      6.8       4 ⋯
 127 │     21.0       0                 61.0       0     0.14     25.5       5
 128 │      7.5       1                 64.0       0     0.24     12.9       4
 129 │     41.0       0                 64.0       0     0.28      5.4       5
 130 │     36.0       0                 69.0       0     0.2       7.0       5 ⋯
 131 │     22.0       0                 57.0       0     0.14     16.1       4
 132 │     20.0       0                 62.0       0     0.15      0.0       4
                                                  7 columns and 117 rows omitted

There are a number of missing values in the dataset:

using Statistics
DataFrame(col=propertynames(df),
          missing_fraction=[mean(ismissing.(col)) for col in eachcol(df)])

13×2 DataFrame
 Row │ col                  missing_fraction
     │ Symbol               Float64
─────┼───────────────────────────────────────
   1 │ survival                   0.0151515
   2 │ alive                      0.00757576
   3 │ age_at_heart_attack        0.0378788
   4 │ pe                         0.00757576
   5 │ fs                         0.0606061
   6 │ epss                       0.113636
   7 │ lvdd                       0.0833333
   8 │ wm_score                   0.030303
   9 │ wm_index                   0.00757576
  10 │ mult                       0.030303
  11 │ name                       0.0
  12 │ group                      0.166667
  13 │ alive_at_one               0.439394

Learner Interface

OptImpute uses the same learner interface as other IAI packages to allow you to properly conduct imputation on data that has been split into training and testing sets.

In this problem, we will use a survival framework and split the data into training and testing:

X = df[!, 3:end]
died = [ismissing(x) || x == 0 for x in df.alive]
times = df.survival
(train_X, train_died, train_times), (test_X, test_died, test_times) =
    IAI.split_data(:survival, X, died, times, seed=1)

First, create a learner and set parameters as normal:

lnr = IAI.OptKNNImputationLearner(random_seed=1)

Unfitted OptKNNImputationLearner:
  random_seed: 1

Note that it is also possible to construct the learners with ImputationLearner and the method keyword argument (which can be useful when specifying the method programmatically):

lnr = IAI.ImputationLearner(method=:opt_knn, random_seed=1)

Unfitted OptKNNImputationLearner:
  random_seed: 1

We can then train the imputation learner on the training dataset with fit!:

IAI.fit!(lnr, train_X)

Fitted OptKNNImputationLearner

The fitted learner can then be used to fill missing values with transform:

IAI.transform(lnr, test_X)

40×11 DataFrame
 Row │ age_at_heart_attack  pe        fs        epss      lvdd      wm_score   ⋯
     │ Float64?             Float64?  Float64?  Float64?  Float64?  Float64?   ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │                72.0       0.0    0.38     6.0       4.1         14.0    ⋯
   2 │                57.0       0.0    0.16    22.0       5.75        18.0
   3 │                73.0       0.0    0.33     6.0       4.0         14.0
   4 │                85.0       1.0    0.18    19.0       5.46        13.83
   5 │                54.0       0.0    0.3      7.0       3.85        10.0    ⋯
   6 │                55.0       0.0    0.2315   7.0       4.611        2.0
   7 │                65.0       0.0    0.15     9.29387   5.05        10.0
   8 │                60.0       0.0    0.222   12.0       4.43909      6.0
  ⋮  │          ⋮              ⋮         ⋮         ⋮         ⋮         ⋮       ⋱
  34 │                64.0       0.0    0.15     6.6       4.17        14.0    ⋯
  35 │                61.0       0.0    0.18     0.0       4.48        11.0
  36 │                48.0       0.0    0.15    12.0       3.66        10.0
  37 │                65.5       0.0    0.09     6.8       4.96        13.0
  38 │                61.0       0.0    0.14    25.5       5.16        14.0    ⋯
  39 │                64.0       0.0    0.24    12.9       4.72        12.0
  40 │                57.0       0.0    0.14    16.1       4.36        15.0
                                                   5 columns and 25 rows omitted

We commonly want to impute on the training set right after fitting the learner, so you can combine these two steps using fit_transform!:

IAI.fit_transform!(lnr, train_X)

92×11 DataFrame
 Row │ age_at_heart_attack  pe        fs        epss      lvdd      wm_score   ⋯
     │ Float64?             Float64?  Float64?  Float64?  Float64?  Float64?   ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │             71.0          0.0     0.26      9.0       4.6        14.0   ⋯
   2 │             55.0          0.0     0.26      4.0       3.42       14.0
   3 │             60.0          0.0     0.253    12.062     4.603      16.0
   4 │             68.0          0.0     0.26      5.0       4.31       12.0
   5 │             62.0          0.0     0.23     31.0       5.43       22.5   ⋯
   6 │             60.0          0.0     0.33      8.0       5.25       14.0
   7 │             46.0          0.0     0.34      0.0       5.09       16.0
   8 │             54.0          0.0     0.14     13.0       4.49       15.5
  ⋮  │          ⋮              ⋮         ⋮         ⋮         ⋮         ⋮       ⋱
  86 │             54.0          0.0     0.43      9.3       4.79       10.0   ⋯
  87 │             57.6429       0.0     0.23     19.1       5.49       12.0
  88 │             57.0          1.0     0.12      0.0       2.32       16.5
  89 │             61.0          1.0     0.19     13.2       5.04       19.0
  90 │             64.0          0.0     0.28      5.4       5.47       11.0   ⋯
  91 │             69.0          0.0     0.2       7.0       5.05       14.5
  92 │             62.0          0.0     0.15      0.0       4.51       15.5
                                                   5 columns and 77 rows omitted

To tune parameters, you can use the standard GridSearch interface, refer to the documentation on parameter tuning for more information.