Optimal Imputation

Quick Start Guide: Optimal Imputation

This is a Python version of the corresponding OptImpute quick start guide.

On this page we show examples of how to use the imputation methods of OptImpute on the echocardiogram dataset:

import pandas as pd
df = pd.read_csv(
    "echocardiogram.data",
    na_values="?",
    header=None,
    names=['survival', 'alive', 'age_at_heart_attack', 'pe', 'fs', 'epss',
           'lvdd', 'wm_score', 'wm_index', 'mult', 'name', 'group',
           'alive_at_one'],
)
df.name = df.name.astype('category')
     survival  alive  age_at_heart_attack  ...  name  group  alive_at_one
0        11.0    0.0                 71.0  ...  name    1.0           0.0
1        19.0    0.0                 72.0  ...  name    1.0           0.0
2        16.0    0.0                 55.0  ...  name    1.0           0.0
3        57.0    0.0                 60.0  ...  name    1.0           0.0
4        19.0    1.0                 57.0  ...  name    1.0           0.0
5        26.0    0.0                 68.0  ...  name    1.0           0.0
6        13.0    0.0                 62.0  ...  name    1.0           0.0
..        ...    ...                  ...  ...   ...    ...           ...
125      17.0    0.0                  NaN  ...  name    NaN           NaN
126      21.0    0.0                 61.0  ...  name    NaN           NaN
127       7.5    1.0                 64.0  ...  name    NaN           NaN
128      41.0    0.0                 64.0  ...  name    NaN           NaN
129      36.0    0.0                 69.0  ...  name    NaN           NaN
130      22.0    0.0                 57.0  ...  name    NaN           NaN
131      20.0    0.0                 62.0  ...  name    NaN           NaN

[132 rows x 13 columns]

There are a number of missing values in the dataset:

df.isnull().sum() / len(df)
survival               0.015152
alive                  0.007576
age_at_heart_attack    0.037879
pe                     0.007576
fs                     0.060606
epss                   0.113636
lvdd                   0.083333
wm_score               0.030303
wm_index               0.007576
mult                   0.030303
name                   0.000000
group                  0.166667
alive_at_one           0.439394
dtype: float64

Simple Imputation

We can use impute to simply fill the missing values in a DataFrame :

from interpretableai import iai
df_imputed = iai.impute(df)
     survival  alive  age_at_heart_attack  ...  name  group  alive_at_one
0        11.0    0.0               71.000  ...  name    1.0      0.000000
1        19.0    0.0               72.000  ...  name    1.0      0.000000
2        16.0    0.0               55.000  ...  name    1.0      0.000000
3        57.0    0.0               60.000  ...  name    1.0      0.000000
4        19.0    1.0               57.000  ...  name    1.0      0.000000
5        26.0    0.0               68.000  ...  name    1.0      0.000000
6        13.0    0.0               62.000  ...  name    1.0      0.000000
..        ...    ...                  ...  ...   ...    ...           ...
125      17.0    0.0               61.875  ...  name    2.0      0.000004
126      21.0    0.0               61.000  ...  name    2.0      0.000003
127       7.5    1.0               64.000  ...  name    2.0      0.893597
128      41.0    0.0               64.000  ...  name    2.0      0.000004
129      36.0    0.0               69.000  ...  name    2.0      0.000003
130      22.0    0.0               57.000  ...  name    2.0      0.000003
131      20.0    0.0               62.000  ...  name    2.0      0.000003

[132 rows x 13 columns]

We can control the method to use for imputation by passing the method:

df_imputed = iai.impute(df, 'opt_svm')

If you don't know which imputation method or parameter values are best, you can define the grid of parameters to be searched over:

df_imputed = iai.impute(df, {'method': ['opt_knn', 'opt_svm']})

You can also use impute_cv to conduct the search with cross-validation:

df_imputed = iai.impute_cv(df, {'method': ['opt_knn', 'opt_svm']})

Learner Interface

You can also use OptImpute using the same learner interface as other IAI packages. In particular, you should use this approach when you want to properly conduct an out-of-sample evaluation of performance.

We will split the data into training and testing:

(train_df,),  (test_df,) = iai.split_data('imputation', df, seed=1)

First, create a learner and set parameters as normal:

lnr = iai.OptKNNImputationLearner(random_seed=1)
Unfitted OptKNNImputationLearner:
  random_seed: 1

Note that it is also possible to construct the learners with ImputationLearner and the method keyword argument (which can be useful when specifying the method programmatically):

lnr = iai.ImputationLearner(method='opt_knn', random_seed=1)
Unfitted OptKNNImputationLearner:
  random_seed: 1

We can then train the imputation learner on the training dataset with fit:

lnr.fit(train_df)
Fitted OptKNNImputationLearner

The fitted learner can then be used to fill missing values with transform:

lnr.transform(test_df)
    survival  alive  age_at_heart_attack   pe  ...   mult  name  group  alive_at_one
0      26.00    0.0                 68.0  0.0  ...  0.857  name    1.0      0.000000
1      25.00    0.0                 54.0  0.0  ...  0.930  name    1.0      0.000000
2      10.00    1.0                 77.0  0.0  ...  0.714  name    1.0      1.000000
3      52.00    0.0                 73.0  0.0  ...  1.000  name    1.0      0.000000
4      44.00    0.0                 60.0  0.0  ...  1.000  name    1.0      0.000000
5       0.50    1.0                 69.0  1.0  ...  0.784  name    1.0      1.000000
6       0.75    1.0                 69.0  0.0  ...  0.857  name    1.0      1.000000
..       ...    ...                  ...  ...  ...    ...   ...    ...           ...
33     31.00    0.0                 53.0  0.0  ...  0.710  name    2.0      0.000087
34     25.00    0.0                 62.0  0.0  ...  0.786  name    2.0      0.000083
35     25.00    0.0                 57.0  0.0  ...  0.786  name    2.0      0.000000
36      3.00    1.0                 62.0  0.0  ...  1.000  name    2.0      1.000000
37     25.00    0.0                 59.0  1.0  ...  0.857  name    1.8      0.000040
38     36.00    0.0                 48.0  0.0  ...  0.714  name    2.0      0.000061
39     22.00    0.0                 57.0  0.0  ...  0.786  name    2.0      0.000084

[40 rows x 13 columns]

We commonly want to impute on the training set right after fitting the learner, so you can combine these two steps using fit_transform:

lnr.fit_transform(train_df)
    survival  alive  age_at_heart_attack   pe  ...   mult  name  group  alive_at_one
0       11.0    0.0            71.000000  0.0  ...  1.000  name    1.0      0.000000
1       19.0    0.0            72.000000  0.0  ...  0.588  name    1.0      0.000000
2       16.0    0.0            55.000000  0.0  ...  1.000  name    1.0      0.000000
3       57.0    0.0            60.000000  0.0  ...  0.788  name    1.0      0.000000
4       19.0    1.0            57.000000  0.0  ...  0.571  name    1.0      0.000000
5       13.0    0.0            62.000000  0.0  ...  0.857  name    1.0      0.000000
6       50.0    0.0            60.000000  0.0  ...  1.000  name    1.0      0.000000
..       ...    ...                  ...  ...  ...    ...   ...    ...           ...
85      12.0    0.0            61.000000  1.0  ...  0.786  name    2.0      0.000064
86      17.0    0.0            62.705882  0.0  ...  0.857  name    2.0      0.000115
87      21.0    0.0            61.000000  0.0  ...  0.786  name    2.0      0.000074
88       7.5    1.0            64.000000  0.0  ...  0.857  name    2.0      0.999997
89      41.0    0.0            64.000000  0.0  ...  0.714  name    2.0      0.000091
90      36.0    0.0            69.000000  0.0  ...  0.857  name    2.0      0.000095
91      20.0    0.0            62.000000  0.0  ...  0.786  name    2.0      0.000110

[92 rows x 13 columns]

To tune parameters, you can use the standard GridSearch interface, refer to the documentation on parameter tuning for more information.