Quick Start Guide: Optimal Imputation
This is a Python version of the corresponding OptImpute quick start guide.
On this page we show examples of how to use the imputation methods of OptImpute on the echocardiogram dataset:
import pandas as pd
df = pd.read_csv(
"echocardiogram.data",
na_values="?",
header=None,
names=['survival', 'alive', 'age_at_heart_attack', 'pe', 'fs', 'epss',
'lvdd', 'wm_score', 'wm_index', 'mult', 'name', 'group',
'alive_at_one'],
)
df.name = df.name.astype('category')
survival alive age_at_heart_attack ... name group alive_at_one
0 11.0 0.0 71.0 ... name 1.0 0.0
1 19.0 0.0 72.0 ... name 1.0 0.0
2 16.0 0.0 55.0 ... name 1.0 0.0
3 57.0 0.0 60.0 ... name 1.0 0.0
4 19.0 1.0 57.0 ... name 1.0 0.0
5 26.0 0.0 68.0 ... name 1.0 0.0
6 13.0 0.0 62.0 ... name 1.0 0.0
.. ... ... ... ... ... ... ...
125 17.0 0.0 NaN ... name NaN NaN
126 21.0 0.0 61.0 ... name NaN NaN
127 7.5 1.0 64.0 ... name NaN NaN
128 41.0 0.0 64.0 ... name NaN NaN
129 36.0 0.0 69.0 ... name NaN NaN
130 22.0 0.0 57.0 ... name NaN NaN
131 20.0 0.0 62.0 ... name NaN NaN
[132 rows x 13 columns]
There are a number of missing values in the dataset:
df.isnull().sum() / len(df)
survival 0.015152
alive 0.007576
age_at_heart_attack 0.037879
pe 0.007576
fs 0.060606
epss 0.113636
lvdd 0.083333
wm_score 0.030303
wm_index 0.007576
mult 0.030303
name 0.000000
group 0.166667
alive_at_one 0.439394
dtype: float64
Simple Imputation
We can use impute
to simply fill the missing values in a DataFrame
:
from interpretableai import iai
df_imputed = iai.impute(df, random_seed=1)
survival alive age_at_heart_attack ... name group alive_at_one
0 11.0 0.0 71.000 ... name 1.0 0.000000
1 19.0 0.0 72.000 ... name 1.0 0.000000
2 16.0 0.0 55.000 ... name 1.0 0.000000
3 57.0 0.0 60.000 ... name 1.0 0.000000
4 19.0 1.0 57.000 ... name 1.0 0.000000
5 26.0 0.0 68.000 ... name 1.0 0.000000
6 13.0 0.0 62.000 ... name 1.0 0.000000
.. ... ... ... ... ... ... ...
125 17.0 0.0 61.875 ... name 2.0 0.000187
126 21.0 0.0 61.000 ... name 2.0 0.000154
127 7.5 1.0 64.000 ... name 2.0 0.862109
128 41.0 0.0 64.000 ... name 2.0 0.000179
129 36.0 0.0 69.000 ... name 2.0 0.000150
130 22.0 0.0 57.000 ... name 2.0 0.000149
131 20.0 0.0 62.000 ... name 2.0 0.000158
[132 rows x 13 columns]
We can control the method to use for imputation by passing the method:
df_imputed = iai.impute(df, 'opt_tree', random_seed=1)
If you don't know which imputation method or parameter values are best, you can define the grid of parameters to be searched over:
df_imputed = iai.impute(df, {'method': ['opt_knn', 'opt_tree']}, random_seed=1)
You can also use impute_cv
to conduct the search with cross-validation:
df_imputed = iai.impute_cv(df, {'method': ['opt_knn', 'opt_svm']},
random_seed=1)
Learner Interface
You can also use OptImpute using the same learner interface as other IAI packages. In particular, you should use this approach when you want to properly conduct an out-of-sample evaluation of performance.
In this problem, we will use a survival framework and split the data into training and testing:
X = df.iloc[:, 2:14]
died = [pd.isnull(x) or x == 0 for x in df.iloc[:, 1]]
times = df.iloc[:, 0]
(train_X, train_died, train_times), (test_X, test_died, test_times) = iai.split_data('survival', X, died, times, seed=1)
First, create a learner and set parameters as normal:
lnr = iai.OptKNNImputationLearner(random_seed=1)
Unfitted OptKNNImputationLearner:
random_seed: 1
Note that it is also possible to construct the learners with ImputationLearner
and the method
keyword argument (which can be useful when specifying the method programmatically):
lnr = iai.ImputationLearner(method='opt_knn', random_seed=1)
Unfitted OptKNNImputationLearner:
random_seed: 1
We can then train the imputation learner on the training dataset with fit
:
lnr.fit(train_X)
Fitted OptKNNImputationLearner
The fitted learner can then be used to fill missing values with transform
:
lnr.transform(test_X)
age_at_heart_attack pe fs epss ... mult name group alive_at_one
0 72.0 0.0 0.3800 6.0 ... 0.588 name 1.0 0.000000
1 57.0 0.0 0.1600 22.0 ... 0.571 name 1.0 0.000000
2 73.0 0.0 0.3300 6.0 ... 1.000 name 1.0 0.000000
3 85.0 1.0 0.1800 19.0 ... 0.710 name 1.0 1.000000
4 64.0 0.0 0.1900 5.9 ... 0.640 name 2.0 0.011187
5 54.0 0.0 0.3000 7.0 ... 0.430 name 2.0 0.017516
6 55.0 0.0 0.2355 7.0 ... 2.000 name 2.0 0.004644
.. ... ... ... ... ... ... ... ... ...
33 57.0 0.0 0.0360 7.0 ... 0.786 name 2.0 0.014034
34 57.0 0.0 0.2900 9.4 ... 0.640 name 2.0 0.007799
35 54.0 0.0 0.4300 9.3 ... 0.714 name 2.0 0.004980
36 64.0 0.0 0.1500 6.6 ... 0.786 name 2.0 0.007343
37 57.0 1.0 0.1200 0.0 ... 0.857 name 1.8 0.000690
38 61.0 1.0 0.1900 13.2 ... 0.786 name 2.0 0.000068
39 69.0 0.0 0.2000 7.0 ... 0.857 name 2.0 0.004542
[40 rows x 11 columns]
We commonly want to impute on the training set right after fitting the learner, so you can combine these two steps using fit_transform
:
lnr.fit_transform(train_X)
age_at_heart_attack pe fs epss ... mult name group alive_at_one
0 71.0000 0.0 0.260 9.000 ... 1.000 name 1.0 0.000000
1 55.0000 0.0 0.260 4.000 ... 1.000 name 1.0 0.000000
2 60.0000 0.0 0.253 12.062 ... 0.788 name 1.0 0.000000
3 68.0000 0.0 0.260 5.000 ... 0.857 name 1.0 0.000000
4 62.0000 0.0 0.230 31.000 ... 0.857 name 1.0 0.000000
5 60.0000 0.0 0.330 8.000 ... 1.000 name 1.0 0.000000
6 46.0000 0.0 0.340 0.000 ... 1.003 name 1.0 0.000000
.. ... ... ... ... ... ... ... ... ...
85 48.0000 0.0 0.150 12.000 ... 0.714 name 2.0 0.006803
86 66.4375 0.0 0.090 6.800 ... 0.857 name 2.0 0.009307
87 61.0000 0.0 0.140 25.500 ... 0.786 name 2.0 0.012657
88 64.0000 0.0 0.240 12.900 ... 0.857 name 2.0 0.008295
89 64.0000 0.0 0.280 5.400 ... 0.714 name 2.0 0.010687
90 57.0000 0.0 0.140 16.100 ... 0.786 name 2.0 0.038705
91 62.0000 0.0 0.150 0.000 ... 0.786 name 2.0 0.016973
[92 rows x 11 columns]
To tune parameters, you can use the standard GridSearch
interface, refer to the documentation on parameter tuning for more information.