Quick Start Guide

On this page we show examples of how to use the imputation methods of OptImpute on the echocardiogram dataset:

using CSV, DataFrames
df = DataFrame(CSV.File("echocardiogram.data",
    missingstring="?",
    pool=true,
    header=[:survival, :alive, :age_at_heart_attack, :pe, :fs, :epss, :lvdd,
            :wm_score, :wm_index, :mult, :name, :group, :alive_at_one],
))

132×13 DataFrame. Omitted printing of 7 columns
│ Row │ survival │ alive  │ age_at_heart_attack │ pe     │ fs       │ epss     │
│     │ Float64? │ Int64? │ Float64?            │ Int64? │ Float64? │ Float64? │
├─────┼──────────┼────────┼─────────────────────┼────────┼──────────┼──────────┤
│ 1   │ 11.0     │ 0      │ 71.0                │ 0      │ 0.26     │ 9.0      │
│ 2   │ 19.0     │ 0      │ 72.0                │ 0      │ 0.38     │ 6.0      │
│ 3   │ 16.0     │ 0      │ 55.0                │ 0      │ 0.26     │ 4.0      │
│ 4   │ 57.0     │ 0      │ 60.0                │ 0      │ 0.253    │ 12.062   │
│ 5   │ 19.0     │ 1      │ 57.0                │ 0      │ 0.16     │ 22.0     │
│ 6   │ 26.0     │ 0      │ 68.0                │ 0      │ 0.26     │ 5.0      │
│ 7   │ 13.0     │ 0      │ 62.0                │ 0      │ 0.23     │ 31.0     │
⋮
│ 125 │ 36.0     │ 0      │ 48.0                │ 0      │ 0.15     │ 12.0     │
│ 126 │ 17.0     │ 0      │ missing             │ 0      │ 0.09     │ 6.8      │
│ 127 │ 21.0     │ 0      │ 61.0                │ 0      │ 0.14     │ 25.5     │
│ 128 │ 7.5      │ 1      │ 64.0                │ 0      │ 0.24     │ 12.9     │
│ 129 │ 41.0     │ 0      │ 64.0                │ 0      │ 0.28     │ 5.4      │
│ 130 │ 36.0     │ 0      │ 69.0                │ 0      │ 0.2      │ 7.0      │
│ 131 │ 22.0     │ 0      │ 57.0                │ 0      │ 0.14     │ 16.1     │
│ 132 │ 20.0     │ 0      │ 62.0                │ 0      │ 0.15     │ 0.0      │

There are a number of missing values in the dataset:

using Statistics
DataFrame(col=propertynames(df),
          missing_fraction=[mean(ismissing.(col)) for col in eachcol(df)])

13×2 DataFrame
│ Row │ col                 │ missing_fraction │
│     │ Symbol              │ Float64          │
├─────┼─────────────────────┼──────────────────┤
│ 1   │ survival            │ 0.0151515        │
│ 2   │ alive               │ 0.00757576       │
│ 3   │ age_at_heart_attack │ 0.0378788        │
│ 4   │ pe                  │ 0.00757576       │
│ 5   │ fs                  │ 0.0606061        │
│ 6   │ epss                │ 0.113636         │
│ 7   │ lvdd                │ 0.0833333        │
│ 8   │ wm_score            │ 0.030303         │
│ 9   │ wm_index            │ 0.00757576       │
│ 10  │ mult                │ 0.030303         │
│ 11  │ name                │ 0.0              │
│ 12  │ group               │ 0.166667         │
│ 13  │ alive_at_one        │ 0.439394         │

Simple Imputation

We can use impute to simply fill the missing values in a DataFrame :

IAI.impute(df, random_seed=1)

132×13 DataFrame. Omitted printing of 8 columns
│ Row │ survival │ alive    │ age_at_heart_attack │ pe       │ fs       │
│     │ Float64? │ Float64? │ Float64?            │ Float64? │ Float64? │
├─────┼──────────┼──────────┼─────────────────────┼──────────┼──────────┤
│ 1   │ 11.0     │ 0.0      │ 71.0                │ 0.0      │ 0.26     │
│ 2   │ 19.0     │ 0.0      │ 72.0                │ 0.0      │ 0.38     │
│ 3   │ 16.0     │ 0.0      │ 55.0                │ 0.0      │ 0.26     │
│ 4   │ 57.0     │ 0.0      │ 60.0                │ 0.0      │ 0.253    │
│ 5   │ 19.0     │ 1.0      │ 57.0                │ 0.0      │ 0.16     │
│ 6   │ 26.0     │ 0.0      │ 68.0                │ 0.0      │ 0.26     │
│ 7   │ 13.0     │ 0.0      │ 62.0                │ 0.0      │ 0.23     │
⋮
│ 125 │ 36.0     │ 0.0      │ 48.0                │ 0.0      │ 0.15     │
│ 126 │ 17.0     │ 0.0      │ 61.875              │ 0.0      │ 0.09     │
│ 127 │ 21.0     │ 0.0      │ 61.0                │ 0.0      │ 0.14     │
│ 128 │ 7.5      │ 1.0      │ 64.0                │ 0.0      │ 0.24     │
│ 129 │ 41.0     │ 0.0      │ 64.0                │ 0.0      │ 0.28     │
│ 130 │ 36.0     │ 0.0      │ 69.0                │ 0.0      │ 0.2      │
│ 131 │ 22.0     │ 0.0      │ 57.0                │ 0.0      │ 0.14     │
│ 132 │ 20.0     │ 0.0      │ 62.0                │ 0.0      │ 0.15     │

We can control the method to use for imputation by passing the method:

IAI.impute(df, :opt_tree, random_seed=1)

If you don't know which imputation method or parameter values are best, you can define the grid of parameters to be searched over:

IAI.impute(df, Dict(:method => [:opt_knn, :opt_tree]), random_seed=1)

You can also use impute_cv to conduct the search with cross-validation:

IAI.impute_cv(df, Dict(:method => [:opt_knn, :opt_svm]), random_seed=1)

You can also use OptImpute using the same learner interface as other IAI packages. In particular, you should use this approach when you want to properly conduct an out-of-sample evaluation of performance.

We will split the data into training and testing:

(train_df,),  (test_df,) = IAI.split_data(:imputation, df, seed=1)

First, create a learner and set parameters as normal:

lnr = IAI.OptKNNImputationLearner(random_seed=1)

Unfitted OptKNNImputationLearner:
  random_seed: 1

Note that it is also possible to construct the learners with ImputationLearner and the method keyword argument (which can be useful when specifying the method programmatically):

lnr = IAI.ImputationLearner(method=:opt_knn, random_seed=1)

Unfitted OptKNNImputationLearner:
  random_seed: 1

We can then train the imputation learner on the training dataset with fit!:

IAI.fit!(lnr, train_df)

Fitted OptKNNImputationLearner

The fitted learner can then be used to fill missing values with transform:

IAI.transform(lnr, test_df)

40×13 DataFrame. Omitted printing of 8 columns
│ Row │ survival │ alive    │ age_at_heart_attack │ pe       │ fs       │
│     │ Float64? │ Float64? │ Float64?            │ Float64? │ Float64? │
├─────┼──────────┼──────────┼─────────────────────┼──────────┼──────────┤
│ 1   │ 19.0     │ 0.0      │ 72.0                │ 0.0      │ 0.38     │
│ 2   │ 19.0     │ 1.0      │ 57.0                │ 0.0      │ 0.16     │
│ 3   │ 52.0     │ 0.0      │ 73.0                │ 0.0      │ 0.33     │
│ 4   │ 0.75     │ 1.0      │ 85.0                │ 1.0      │ 0.18     │
│ 5   │ 48.0     │ 0.0      │ 64.0                │ 0.0      │ 0.19     │
│ 6   │ 29.0     │ 0.0      │ 54.0                │ 0.0      │ 0.3      │
│ 7   │ 29.0     │ 0.0      │ 55.0                │ 0.0      │ 0.269    │
⋮
│ 33  │ 25.0     │ 0.0      │ 57.0                │ 0.0      │ 0.228    │
│ 34  │ 24.0     │ 0.0      │ 57.0                │ 0.0      │ 0.036    │
│ 35  │ 27.0     │ 0.0      │ 57.0                │ 0.0      │ 0.29     │
│ 36  │ 34.0     │ 0.0      │ 54.0                │ 0.0      │ 0.43     │
│ 37  │ 17.0     │ 0.0      │ 64.0                │ 0.0      │ 0.15     │
│ 38  │ 38.0     │ 0.0      │ 57.0                │ 1.0      │ 0.12     │
│ 39  │ 12.0     │ 0.0      │ 61.0                │ 1.0      │ 0.19     │
│ 40  │ 36.0     │ 0.0      │ 69.0                │ 0.0      │ 0.2      │

We commonly want to impute on the training set right after fitting the learner, so you can combine these two steps using fit_transform!:

IAI.fit_transform!(lnr, train_df)

92×13 DataFrame. Omitted printing of 8 columns
│ Row │ survival │ alive    │ age_at_heart_attack │ pe       │ fs       │
│     │ Float64? │ Float64? │ Float64?            │ Float64? │ Float64? │
├─────┼──────────┼──────────┼─────────────────────┼──────────┼──────────┤
│ 1   │ 11.0     │ 0.0      │ 71.0                │ 0.0      │ 0.26     │
│ 2   │ 16.0     │ 0.0      │ 55.0                │ 0.0      │ 0.26     │
│ 3   │ 57.0     │ 0.0      │ 60.0                │ 0.0      │ 0.253    │
│ 4   │ 26.0     │ 0.0      │ 68.0                │ 0.0      │ 0.26     │
│ 5   │ 13.0     │ 0.0      │ 62.0                │ 0.0      │ 0.23     │
│ 6   │ 50.0     │ 0.0      │ 60.0                │ 0.0      │ 0.33     │
│ 7   │ 19.0     │ 0.0      │ 46.0                │ 0.0      │ 0.34     │
⋮
│ 85  │ 31.0     │ 0.0      │ 61.0                │ 0.0      │ 0.18     │
│ 86  │ 36.0     │ 0.0      │ 48.0                │ 0.0      │ 0.15     │
│ 87  │ 17.0     │ 0.0      │ 61.8235             │ 0.0      │ 0.09     │
│ 88  │ 21.0     │ 0.0      │ 61.0                │ 0.0      │ 0.14     │
│ 89  │ 7.5      │ 1.0      │ 64.0                │ 0.0      │ 0.24     │
│ 90  │ 41.0     │ 0.0      │ 64.0                │ 0.0      │ 0.28     │
│ 91  │ 22.0     │ 0.0      │ 57.0                │ 0.0      │ 0.14     │
│ 92  │ 20.0     │ 0.0      │ 62.0                │ 0.0      │ 0.15     │

To tune parameters, you can use the standard GridSearch interface, refer to the documentation on parameter tuning for more information.

Quick Start Guide

Simple Imputation

Learner Interface