Quick Start Guide: Optimal Imputation

This is a Python version of the corresponding OptImpute quick start guide.

On this page we show examples of how to use the imputation methods of OptImpute on the echocardiogram dataset:

import pandas as pd
df = pd.read_csv(
    "echocardiogram.data",
    na_values="?",
    header=None,
    names=['survival', 'alive', 'age_at_heart_attack', 'pe', 'fs', 'epss',
           'lvdd', 'wm_score', 'wm_index', 'mult', 'name', 'group',
           'alive_at_one'],
)
df.name = df.name.astype('category')
     survival  alive  age_at_heart_attack  ...  name  group  alive_at_one
0        11.0    0.0                 71.0  ...  name    1.0           0.0
1        19.0    0.0                 72.0  ...  name    1.0           0.0
2        16.0    0.0                 55.0  ...  name    1.0           0.0
3        57.0    0.0                 60.0  ...  name    1.0           0.0
4        19.0    1.0                 57.0  ...  name    1.0           0.0
5        26.0    0.0                 68.0  ...  name    1.0           0.0
6        13.0    0.0                 62.0  ...  name    1.0           0.0
..        ...    ...                  ...  ...   ...    ...           ...
125      17.0    0.0                  NaN  ...  name    NaN           NaN
126      21.0    0.0                 61.0  ...  name    NaN           NaN
127       7.5    1.0                 64.0  ...  name    NaN           NaN
128      41.0    0.0                 64.0  ...  name    NaN           NaN
129      36.0    0.0                 69.0  ...  name    NaN           NaN
130      22.0    0.0                 57.0  ...  name    NaN           NaN
131      20.0    0.0                 62.0  ...  name    NaN           NaN

[132 rows x 13 columns]

There are a number of missing values in the dataset:

df.isnull().sum() / len(df)
survival               0.015152
alive                  0.007576
age_at_heart_attack    0.037879
pe                     0.007576
fs                     0.060606
epss                   0.113636
lvdd                   0.083333
wm_score               0.030303
wm_index               0.007576
mult                   0.030303
name                   0.000000
group                  0.166667
alive_at_one           0.439394
dtype: float64

Learner Interface

OptImpute uses the same learner interface as other IAI packages to allow you to properly conduct imputation on data that has been split into training and testing sets.

In this problem, we will use a survival framework and split the data into training and testing:

from interpretableai import iai
X = df.iloc[:, 2:14]
died = [pd.isnull(x) or x == 0 for x in df.iloc[:, 1]]
times = df.iloc[:, 0]
(train_X, train_died, train_times), (test_X, test_died, test_times) = iai.split_data('survival', X, died, times, seed=1)

First, create a learner and set parameters as normal:

lnr = iai.OptKNNImputationLearner(random_seed=1)
Unfitted OptKNNImputationLearner:
  random_seed: 1

Note that it is also possible to construct the learners with ImputationLearner and the method keyword argument (which can be useful when specifying the method programmatically):

lnr = iai.ImputationLearner(method='opt_knn', random_seed=1)
Unfitted OptKNNImputationLearner:
  random_seed: 1

We can then train the imputation learner on the training dataset with fit:

lnr.fit(train_X)
Fitted OptKNNImputationLearner

The fitted learner can then be used to fill missing values with transform:

lnr.transform(test_X)
    age_at_heart_attack   pe      fs  ...  name  group  alive_at_one
0                  72.0  0.0  0.3800  ...  name    1.0      0.000000
1                  57.0  0.0  0.1600  ...  name    1.0      0.000000
2                  73.0  0.0  0.3300  ...  name    1.0      0.000000
3                  85.0  1.0  0.1800  ...  name    1.0      1.000000
4                  54.0  0.0  0.3000  ...  name    2.0      0.001848
5                  55.0  0.0  0.2315  ...  name    2.0      0.001905
6                  65.0  0.0  0.1500  ...  name    2.0      0.002333
..                  ...  ...     ...  ...   ...    ...           ...
33                 64.0  0.0  0.1500  ...  name    2.0      0.001378
34                 61.0  0.0  0.1800  ...  name    2.0      0.001876
35                 48.0  0.0  0.1500  ...  name    2.0      0.001904
36                 65.5  0.0  0.0900  ...  name    2.0      0.000978
37                 61.0  0.0  0.1400  ...  name    2.0      0.013623
38                 64.0  0.0  0.2400  ...  name    2.0      0.001592
39                 57.0  0.0  0.1400  ...  name    2.0      0.004699

[40 rows x 11 columns]

We commonly want to impute on the training set right after fitting the learner, so you can combine these two steps using fit_transform:

lnr.fit_transform(train_X)
    age_at_heart_attack   pe     fs    epss  ...   mult  name  group  alive_at_one
0             71.000000  0.0  0.260   9.000  ...  1.000  name    1.0      0.000000
1             55.000000  0.0  0.260   4.000  ...  1.000  name    1.0      0.000000
2             60.000000  0.0  0.253  12.062  ...  0.788  name    1.0      0.000000
3             68.000000  0.0  0.260   5.000  ...  0.857  name    1.0      0.000000
4             62.000000  0.0  0.230  31.000  ...  0.857  name    1.0      0.000000
5             60.000000  0.0  0.330   8.000  ...  1.000  name    1.0      0.000000
6             46.000000  0.0  0.340   0.000  ...  1.003  name    1.0      0.000000
..                  ...  ...    ...     ...  ...    ...   ...    ...           ...
85            54.000000  0.0  0.430   9.300  ...  0.714  name    2.0      0.002253
86            57.642857  0.0  0.230  19.100  ...  0.710  name    2.0      0.016485
87            57.000000  1.0  0.120   0.000  ...  0.857  name    2.0      0.000282
88            61.000000  1.0  0.190  13.200  ...  0.786  name    2.0      0.000079
89            64.000000  0.0  0.280   5.400  ...  0.714  name    2.0      0.002328
90            69.000000  0.0  0.200   7.000  ...  0.857  name    2.0      0.002100
91            62.000000  0.0  0.150   0.000  ...  0.786  name    2.0      0.001507

[92 rows x 11 columns]

To tune parameters, you can use the standard GridSearch interface, refer to the documentation on parameter tuning for more information.