Quick Start Guide

On this page we show examples of how to use the imputation methods of OptImpute on the echocardiogram dataset:

using CSV, DataFrames
df = DataFrame(CSV.File("echocardiogram.data",
    missingstring="?",
    pool=true,
    header=[:survival, :alive, :age_at_heart_attack, :pe, :fs, :epss, :lvdd,
            :wm_score, :wm_index, :mult, :name, :group, :alive_at_one],
))
132×13 DataFrame. Omitted printing of 7 columns
│ Row │ survival │ alive  │ age_at_heart_attack │ pe     │ fs       │ epss     │
│     │ Float64? │ Int64? │ Float64?            │ Int64? │ Float64? │ Float64? │
├─────┼──────────┼────────┼─────────────────────┼────────┼──────────┼──────────┤
│ 1   │ 11.0     │ 0      │ 71.0                │ 0      │ 0.26     │ 9.0      │
│ 2   │ 19.0     │ 0      │ 72.0                │ 0      │ 0.38     │ 6.0      │
│ 3   │ 16.0     │ 0      │ 55.0                │ 0      │ 0.26     │ 4.0      │
│ 4   │ 57.0     │ 0      │ 60.0                │ 0      │ 0.253    │ 12.062   │
│ 5   │ 19.0     │ 1      │ 57.0                │ 0      │ 0.16     │ 22.0     │
│ 6   │ 26.0     │ 0      │ 68.0                │ 0      │ 0.26     │ 5.0      │
│ 7   │ 13.0     │ 0      │ 62.0                │ 0      │ 0.23     │ 31.0     │
⋮
│ 125 │ 36.0     │ 0      │ 48.0                │ 0      │ 0.15     │ 12.0     │
│ 126 │ 17.0     │ 0      │ missing             │ 0      │ 0.09     │ 6.8      │
│ 127 │ 21.0     │ 0      │ 61.0                │ 0      │ 0.14     │ 25.5     │
│ 128 │ 7.5      │ 1      │ 64.0                │ 0      │ 0.24     │ 12.9     │
│ 129 │ 41.0     │ 0      │ 64.0                │ 0      │ 0.28     │ 5.4      │
│ 130 │ 36.0     │ 0      │ 69.0                │ 0      │ 0.2      │ 7.0      │
│ 131 │ 22.0     │ 0      │ 57.0                │ 0      │ 0.14     │ 16.1     │
│ 132 │ 20.0     │ 0      │ 62.0                │ 0      │ 0.15     │ 0.0      │

There are a number of missing values in the dataset:

using Statistics
DataFrame(col=propertynames(df),
          missing_fraction=[mean(ismissing.(col)) for col in eachcol(df)])
13×2 DataFrame
│ Row │ col                 │ missing_fraction │
│     │ Symbol              │ Float64          │
├─────┼─────────────────────┼──────────────────┤
│ 1   │ survival            │ 0.0151515        │
│ 2   │ alive               │ 0.00757576       │
│ 3   │ age_at_heart_attack │ 0.0378788        │
│ 4   │ pe                  │ 0.00757576       │
│ 5   │ fs                  │ 0.0606061        │
│ 6   │ epss                │ 0.113636         │
│ 7   │ lvdd                │ 0.0833333        │
│ 8   │ wm_score            │ 0.030303         │
│ 9   │ wm_index            │ 0.00757576       │
│ 10  │ mult                │ 0.030303         │
│ 11  │ name                │ 0.0              │
│ 12  │ group               │ 0.166667         │
│ 13  │ alive_at_one        │ 0.439394         │

Simple Imputation

We can use impute to simply fill the missing values in a DataFrame :

IAI.impute(df, random_seed=1)
132×13 DataFrame. Omitted printing of 8 columns
│ Row │ survival │ alive    │ age_at_heart_attack │ pe       │ fs       │
│     │ Float64? │ Float64? │ Float64?            │ Float64? │ Float64? │
├─────┼──────────┼──────────┼─────────────────────┼──────────┼──────────┤
│ 1   │ 11.0     │ 0.0      │ 71.0                │ 0.0      │ 0.26     │
│ 2   │ 19.0     │ 0.0      │ 72.0                │ 0.0      │ 0.38     │
│ 3   │ 16.0     │ 0.0      │ 55.0                │ 0.0      │ 0.26     │
│ 4   │ 57.0     │ 0.0      │ 60.0                │ 0.0      │ 0.253    │
│ 5   │ 19.0     │ 1.0      │ 57.0                │ 0.0      │ 0.16     │
│ 6   │ 26.0     │ 0.0      │ 68.0                │ 0.0      │ 0.26     │
│ 7   │ 13.0     │ 0.0      │ 62.0                │ 0.0      │ 0.23     │
⋮
│ 125 │ 36.0     │ 0.0      │ 48.0                │ 0.0      │ 0.15     │
│ 126 │ 17.0     │ 0.0      │ 61.875              │ 0.0      │ 0.09     │
│ 127 │ 21.0     │ 0.0      │ 61.0                │ 0.0      │ 0.14     │
│ 128 │ 7.5      │ 1.0      │ 64.0                │ 0.0      │ 0.24     │
│ 129 │ 41.0     │ 0.0      │ 64.0                │ 0.0      │ 0.28     │
│ 130 │ 36.0     │ 0.0      │ 69.0                │ 0.0      │ 0.2      │
│ 131 │ 22.0     │ 0.0      │ 57.0                │ 0.0      │ 0.14     │
│ 132 │ 20.0     │ 0.0      │ 62.0                │ 0.0      │ 0.15     │

We can control the method to use for imputation by passing the method:

IAI.impute(df, :opt_tree, random_seed=1)

If you don't know which imputation method or parameter values are best, you can define the grid of parameters to be searched over:

IAI.impute(df, Dict(:method => [:opt_knn, :opt_tree]), random_seed=1)

You can also use impute_cv to conduct the search with cross-validation:

IAI.impute_cv(df, Dict(:method => [:opt_knn, :opt_svm]), random_seed=1)

Learner Interface

You can also use OptImpute using the same learner interface as other IAI packages. In particular, you should use this approach when you want to properly conduct an out-of-sample evaluation of performance.

We will split the data into training and testing:

(train_df,),  (test_df,) = IAI.split_data(:imputation, df, seed=1)

First, create a learner and set parameters as normal:

lnr = IAI.OptKNNImputationLearner(random_seed=1)
Unfitted OptKNNImputationLearner:
  random_seed: 1

Note that it is also possible to construct the learners with ImputationLearner and the method keyword argument (which can be useful when specifying the method programmatically):

lnr = IAI.ImputationLearner(method=:opt_knn, random_seed=1)
Unfitted OptKNNImputationLearner:
  random_seed: 1

We can then train the imputation learner on the training dataset with fit!:

IAI.fit!(lnr, train_df)
Fitted OptKNNImputationLearner

The fitted learner can then be used to fill missing values with transform:

IAI.transform(lnr, test_df)
40×13 DataFrame. Omitted printing of 8 columns
│ Row │ survival │ alive    │ age_at_heart_attack │ pe       │ fs       │
│     │ Float64? │ Float64? │ Float64?            │ Float64? │ Float64? │
├─────┼──────────┼──────────┼─────────────────────┼──────────┼──────────┤
│ 1   │ 19.0     │ 0.0      │ 72.0                │ 0.0      │ 0.38     │
│ 2   │ 19.0     │ 1.0      │ 57.0                │ 0.0      │ 0.16     │
│ 3   │ 52.0     │ 0.0      │ 73.0                │ 0.0      │ 0.33     │
│ 4   │ 0.75     │ 1.0      │ 85.0                │ 1.0      │ 0.18     │
│ 5   │ 48.0     │ 0.0      │ 64.0                │ 0.0      │ 0.19     │
│ 6   │ 29.0     │ 0.0      │ 54.0                │ 0.0      │ 0.3      │
│ 7   │ 29.0     │ 0.0      │ 55.0                │ 0.0      │ 0.269    │
⋮
│ 33  │ 25.0     │ 0.0      │ 57.0                │ 0.0      │ 0.228    │
│ 34  │ 24.0     │ 0.0      │ 57.0                │ 0.0      │ 0.036    │
│ 35  │ 27.0     │ 0.0      │ 57.0                │ 0.0      │ 0.29     │
│ 36  │ 34.0     │ 0.0      │ 54.0                │ 0.0      │ 0.43     │
│ 37  │ 17.0     │ 0.0      │ 64.0                │ 0.0      │ 0.15     │
│ 38  │ 38.0     │ 0.0      │ 57.0                │ 1.0      │ 0.12     │
│ 39  │ 12.0     │ 0.0      │ 61.0                │ 1.0      │ 0.19     │
│ 40  │ 36.0     │ 0.0      │ 69.0                │ 0.0      │ 0.2      │

We commonly want to impute on the training set right after fitting the learner, so you can combine these two steps using fit_transform!:

IAI.fit_transform!(lnr, train_df)
92×13 DataFrame. Omitted printing of 8 columns
│ Row │ survival │ alive    │ age_at_heart_attack │ pe       │ fs       │
│     │ Float64? │ Float64? │ Float64?            │ Float64? │ Float64? │
├─────┼──────────┼──────────┼─────────────────────┼──────────┼──────────┤
│ 1   │ 11.0     │ 0.0      │ 71.0                │ 0.0      │ 0.26     │
│ 2   │ 16.0     │ 0.0      │ 55.0                │ 0.0      │ 0.26     │
│ 3   │ 57.0     │ 0.0      │ 60.0                │ 0.0      │ 0.253    │
│ 4   │ 26.0     │ 0.0      │ 68.0                │ 0.0      │ 0.26     │
│ 5   │ 13.0     │ 0.0      │ 62.0                │ 0.0      │ 0.23     │
│ 6   │ 50.0     │ 0.0      │ 60.0                │ 0.0      │ 0.33     │
│ 7   │ 19.0     │ 0.0      │ 46.0                │ 0.0      │ 0.34     │
⋮
│ 85  │ 31.0     │ 0.0      │ 61.0                │ 0.0      │ 0.18     │
│ 86  │ 36.0     │ 0.0      │ 48.0                │ 0.0      │ 0.15     │
│ 87  │ 17.0     │ 0.0      │ 61.8235             │ 0.0      │ 0.09     │
│ 88  │ 21.0     │ 0.0      │ 61.0                │ 0.0      │ 0.14     │
│ 89  │ 7.5      │ 1.0      │ 64.0                │ 0.0      │ 0.24     │
│ 90  │ 41.0     │ 0.0      │ 64.0                │ 0.0      │ 0.28     │
│ 91  │ 22.0     │ 0.0      │ 57.0                │ 0.0      │ 0.14     │
│ 92  │ 20.0     │ 0.0      │ 62.0                │ 0.0      │ 0.15     │

To tune parameters, you can use the standard GridSearch interface, refer to the documentation on parameter tuning for more information.