Hepatitis Mortality Prediction
In this example, we consider the Hepatitis dataset where the goal is to predict mortality of patients with hepatitis. As we will see, the dataset has a number of missing values, so we explore and compare many options for handling this missing data while building Optimal Classification Trees for this prediction task.
First, we will load the dataset:
using CSV, DataFrames
df = CSV.read(
"hepatitis.data", DataFrame,
missingstring="?",
header=[:outcome, :age, :sex, :steroid, :antivirals, :fatigue, :malaise,
:anorexia, :liver_big, :liver_firm, :spleen_palpable, :spiders,
:ascites, :varices, :bilirubin, :alk_phosphate, :sgot, :albumin,
:protime, :histology]
)
155×20 DataFrame
Row │ outcome age sex steroid antivirals fatigue malaise anorexia ⋯
│ Int64 Int64 Int64 Int64? Int64 Int64? Int64? Int64? ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 2 30 2 1 2 2 2 2 ⋯
2 │ 2 50 1 1 2 1 2 2
3 │ 2 78 1 2 2 1 2 2
4 │ 2 31 1 missing 1 2 2 2
5 │ 2 34 1 2 2 2 2 2 ⋯
6 │ 2 34 1 2 2 2 2 2
7 │ 1 51 1 1 2 1 2 1
8 │ 2 23 1 2 2 2 2 2
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
149 │ 2 20 1 1 2 2 2 2 ⋯
150 │ 2 36 1 2 2 2 2 2
151 │ 1 46 1 2 2 1 1 1
152 │ 2 44 1 2 2 1 2 2
153 │ 2 61 1 1 2 1 1 2 ⋯
154 │ 2 53 2 1 2 1 2 2
155 │ 1 43 1 2 2 1 2 2
12 columns and 140 rows omitted
We observe there are a number of missing values:
using Statistics
sort!(DataFrame(num_missing=[sum(ismissing.(col)) for col in eachcol(df)],
feature=names(df)))
20×2 DataFrame
Row │ num_missing feature
│ Int64 String
─────┼──────────────────────────────
1 │ 0 age
2 │ 0 antivirals
3 │ 0 histology
4 │ 0 outcome
5 │ 0 sex
6 │ 1 anorexia
7 │ 1 fatigue
8 │ 1 malaise
⋮ │ ⋮ ⋮
14 │ 5 varices
15 │ 6 bilirubin
16 │ 10 liver_big
17 │ 11 liver_firm
18 │ 16 albumin
19 │ 29 alk_phosphate
20 │ 67 protime
5 rows omitted
To prepare for training, we extract the features and target, then randomly split into training and test datasets:
X = df[:, 2:end]
y = [val == 1 ? "Die" : "Live" for val in df.outcome]
seed = 12345
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
seed=seed)
Method 1. Complete cases
Subsetting the data
Complete cases is a simple approach where we drop observations with any missing data for training:
train_X_cc = train_X[completecases(train_X), :]
train_y_cc = train_y[completecases(train_X)]
length(train_y_cc), length(train_y)
(51, 108)
Notice we are left with only 51 observations compared to 108 before subsetting. We have lost half of the training set, so we might be losing some information and the model performance might suffer from this.
Model Training
Now train an optimal tree on the complete case data
grid = IAI.GridSearch(
IAI.OptimalTreeClassifier(
random_seed=seed,
criterion=:gini,
),
max_depth=2:6,
)
IAI.fit!(grid, train_X_cc, train_y_cc, validation_criterion=:auc)
IAI.get_learner(grid)
Evaluate the Model
If we restrict the test set to complete cases as well, we will only be left with 29 observations. However, other methods using imputation will make predictions on the entire testing test of 47 observations. Thus, we would not be able to compare the performance between the methods because they are not evaluated the same set of samples. For that reason, we will use mean-impute to fill in the test data so we will still have the same 47 observations when evaluating the complete cases model:
imputer = IAI.ImputationLearner(method=:mean)
IAI.fit!(imputer, train_X)
test_X_imputed = IAI.transform(imputer, test_X)
We save the evaluation results to a dataframe for comparisons later
results = DataFrame(
method=:completecase,
ins_acc=IAI.score(grid, train_X_cc, train_y_cc,
criterion=:misclassification),
oos_acc=IAI.score(grid, test_X_imputed, test_y,
criterion=:misclassification),
ins_auc=IAI.score(grid, train_X_cc, train_y_cc, criterion=:auc),
oos_auc=IAI.score(grid, test_X_imputed, test_y, criterion=:auc),
)
1×5 DataFrame
Row │ method ins_acc oos_acc ins_auc oos_auc
│ Symbol Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────
1 │ completecase 0.941176 0.744681 0.911364 0.585135
Methods 2-4: Using Imputation
Now we are going to explore various imputation methods provided by OptImpute. The pipeline we are going to use has the following steps:
- Train imputation learner and impute missing values on training set with
fit_transform!
- Impute missing values on test set with
transform
- Train predictive model on imputed training set
- Evaluate predictive model on imputed training and test sets
When using imputation before training a predictive model, it is important to conduct the imputation using the process described in Steps 1 and 2 above. For more information, refer to the OptImpute documentation.
Let's now define a function implementing this pipeline that takes in a parameter specifying the imputation method to use:
function impute_then_predict(method)
# Step 1: Train imputation learner and impute on training set
imputer = IAI.ImputationLearner(
method=method,
cluster=true,
cluster_max_size=100,
random_seed=seed,
)
train_X_imputed = IAI.fit_transform!(imputer, train_X)
# Step 2: Impute on test set
test_X_imputed = IAI.transform(imputer, test_X)
# Step 3: Train predictive model on imputed training set
grid = IAI.GridSearch(
IAI.OptimalTreeClassifier(
random_seed=seed,
criterion=:gini,
),
max_depth=2:6,
)
IAI.fit!(grid, train_X_imputed, train_y, validation_criterion=:auc)
# Step 4: Evaluate predictive model on imputed training and test sets
append!(results, DataFrame(
method=method,
ins_acc=IAI.score(grid, train_X_imputed, train_y,
criterion=:misclassification),
oos_acc=IAI.score(grid, test_X_imputed, test_y,
criterion=:misclassification),
ins_auc=IAI.score(grid, train_X_imputed, train_y, criterion=:auc),
oos_auc=IAI.score(grid, test_X_imputed, test_y, criterion=:auc)
))
# Return predictive model
IAI.get_learner(grid)
end
We will look at three imputation methods with this pipeline: random, mean, and optimal imputation using kNN
Method 2. Random imputation
First, we try imputing using a randomly selected value for each variable
lnr_rand = impute_then_predict(:rand)
Method 3. Mean imputation
Next, we impute using the mean for continuous variables and the mode for categoric variables:
lnr_mean = impute_then_predict(:mean)
Method 4. Optimal KNN imputation
Finally, we use globally optimal imputation with k-nearest neighbors by minimizing the total distance to each point's nearest neighbors
lnr_opt_knn = impute_then_predict(:opt_knn)
Results
We can inspect the comparison results:
results
4×5 DataFrame
Row │ method ins_acc oos_acc ins_auc oos_auc
│ Symbol Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────
1 │ completecase 0.941176 0.744681 0.911364 0.585135
2 │ rand 0.953704 0.787234 0.893499 0.644595
3 │ mean 0.962963 0.723404 0.967495 0.706757
4 │ opt_knn 0.962963 0.765957 0.94186 0.744595
Notice that opt_knn
has the highest out-of-sample performance compared to the more naive approaches, and more generally as we increase the quality of the imputation, we see better out-of-sample performance and less overfitting.
We can also see from the trees that after imputing the missing values with opt_knn
, the validated tree uses features that had high levels of missingness in the original data. Coupled with the good out-of-sample results, this indicates that the imputations are enabling the tree to extract more meaningful signals from the important features in the data.
Together, these results give clear evidence that the quality of the imputations can have a significant impact on the performance of the final predictive model.