Hepatitis Mortality Prediction

In this example, we consider the Hepatitis dataset where the goal is to predict mortality of patients with hepatitis. As we will see, the dataset has a number of missing values, so we explore and compare many options for handling this missing data while building Optimal Classification Trees for this prediction task.

First, we will load the dataset:

using CSV, DataFrames
df = CSV.read(
    "hepatitis.data", DataFrame,
    missingstring="?",
    header=[:outcome, :age, :sex, :steroid, :antivirals, :fatigue, :malaise,
            :anorexia, :liver_big, :liver_firm, :spleen_palpable, :spiders,
            :ascites, :varices, :bilirubin, :alk_phosphate, :sgot, :albumin,
            :protime, :histology]
)
155×20 DataFrame
 Row │ outcome  age    sex    steroid  antivirals  fatigue  malaise  anorexia  ⋯
     │ Int64    Int64  Int64  Int64?   Int64       Int64?   Int64?   Int64?    ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │       2     30      2        1           2        2        2         2  ⋯
   2 │       2     50      1        1           2        1        2         2
   3 │       2     78      1        2           2        1        2         2
   4 │       2     31      1  missing           1        2        2         2
   5 │       2     34      1        2           2        2        2         2  ⋯
   6 │       2     34      1        2           2        2        2         2
   7 │       1     51      1        1           2        1        2         1
   8 │       2     23      1        2           2        2        2         2
  ⋮  │    ⋮       ⋮      ⋮       ⋮         ⋮          ⋮        ⋮        ⋮      ⋱
 149 │       2     20      1        1           2        2        2         2  ⋯
 150 │       2     36      1        2           2        2        2         2
 151 │       1     46      1        2           2        1        1         1
 152 │       2     44      1        2           2        1        2         2
 153 │       2     61      1        1           2        1        1         2  ⋯
 154 │       2     53      2        1           2        1        2         2
 155 │       1     43      1        2           2        1        2         2
                                                 12 columns and 140 rows omitted

We observe there are a number of missing values:

using Statistics
sort!(DataFrame(num_missing=[sum(ismissing.(col)) for col in eachcol(df)],
                feature=names(df)))
20×2 DataFrame
 Row │ num_missing  feature
     │ Int64        String
─────┼──────────────────────────────
   1 │           0  age
   2 │           0  antivirals
   3 │           0  histology
   4 │           0  outcome
   5 │           0  sex
   6 │           1  anorexia
   7 │           1  fatigue
   8 │           1  malaise
  ⋮  │      ⋮              ⋮
  14 │           5  varices
  15 │           6  bilirubin
  16 │          10  liver_big
  17 │          11  liver_firm
  18 │          16  albumin
  19 │          29  alk_phosphate
  20 │          67  protime
                      5 rows omitted

To prepare for training, we extract the features and target, then randomly split into training and test datasets:

X = df[:, 2:end]
y = [val == 1 ? "Die" : "Live" for val in df.outcome]
seed = 12345
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
                                                      seed=seed)

Method 1. Complete cases

Subsetting the data

Complete cases is a simple approach where we drop observations with any missing data for training:

train_X_cc = train_X[completecases(train_X), :]
train_y_cc = train_y[completecases(train_X)]
length(train_y_cc), length(train_y)
(51, 108)

Notice we are left with only 51 observations compared to 108 before subsetting. We have lost half of the training set, so we might be losing some information and the model performance might suffer from this.

Model Training

Now train an optimal tree on the complete case data

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=seed,
        criterion=:gini,
    ),
    max_depth=2:6,
)
IAI.fit!(grid, train_X_cc, train_y_cc, validation_criterion=:auc)
IAI.get_learner(grid)
Optimal Trees Visualization

Evaluate the Model

If we restrict the test set to complete cases as well, we will only be left with 29 observations. However, other methods using imputation will make predictions on the entire testing test of 47 observations. Thus, we would not be able to compare the performance between the methods because they are not evaluated the same set of samples. For that reason, we will use mean-impute to fill in the test data so we will still have the same 47 observations when evaluating the complete cases model:

imputer = IAI.ImputationLearner(method=:mean)
IAI.fit!(imputer, train_X)
test_X_imputed = IAI.transform(imputer, test_X)

We save the evaluation results to a dataframe for comparisons later

results = DataFrame(
    method=:completecase,
    ins_acc=IAI.score(grid, train_X_cc, train_y_cc,
                      criterion=:misclassification),
    oos_acc=IAI.score(grid, test_X_imputed, test_y,
                      criterion=:misclassification),
    ins_auc=IAI.score(grid, train_X_cc, train_y_cc, criterion=:auc),
    oos_auc=IAI.score(grid, test_X_imputed, test_y, criterion=:auc),
)
1×5 DataFrame
 Row │ method        ins_acc   oos_acc   ins_auc   oos_auc
     │ Symbol        Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────────────────
   1 │ completecase  0.941176  0.744681  0.911364  0.585135

Methods 2-4: Using Imputation

Now we are going to explore various imputation methods provided by OptImpute. The pipeline we are going to use has the following steps:

  1. Train imputation learner and impute missing values on training set with fit_transform!
  2. Impute missing values on test set with transform
  3. Train predictive model on imputed training set
  4. Evaluate predictive model on imputed training and test sets
Info

When using imputation before training a predictive model, it is important to conduct the imputation using the process described in Steps 1 and 2 above. For more information, refer to the OptImpute documentation.

Let's now define a function implementing this pipeline that takes in a parameter specifying the imputation method to use:

function impute_then_predict(method)
  # Step 1: Train imputation learner and impute on training set
  imputer = IAI.ImputationLearner(
      method=method,
      cluster=true,
      cluster_max_size=100,
      random_seed=seed,
  )
  train_X_imputed = IAI.fit_transform!(imputer, train_X)

  # Step 2: Impute on test set
  test_X_imputed = IAI.transform(imputer, test_X)

  # Step 3: Train predictive model on imputed training set
  grid = IAI.GridSearch(
      IAI.OptimalTreeClassifier(
          random_seed=seed,
          criterion=:gini,
      ),
      max_depth=2:6,
  )
  IAI.fit!(grid, train_X_imputed, train_y, validation_criterion=:auc)

  # Step 4: Evaluate predictive model on imputed training and test sets
  append!(results, DataFrame(
      method=method,
      ins_acc=IAI.score(grid, train_X_imputed, train_y,
                      criterion=:misclassification),
      oos_acc=IAI.score(grid, test_X_imputed, test_y,
                      criterion=:misclassification),
      ins_auc=IAI.score(grid, train_X_imputed, train_y, criterion=:auc),
      oos_auc=IAI.score(grid, test_X_imputed, test_y, criterion=:auc)
  ))

  # Return predictive model
  IAI.get_learner(grid)
end

We will look at three imputation methods with this pipeline: random, mean, and optimal imputation using kNN

Method 2. Random imputation

First, we try imputing using a randomly selected value for each variable

lnr_rand = impute_then_predict(:rand)
Optimal Trees Visualization

Method 3. Mean imputation

Next, we impute using the mean for continuous variables and the mode for categoric variables:

lnr_mean = impute_then_predict(:mean)
Optimal Trees Visualization

Method 4. Optimal KNN imputation

Finally, we use globally optimal imputation with k-nearest neighbors by minimizing the total distance to each point's nearest neighbors

lnr_opt_knn = impute_then_predict(:opt_knn)
Optimal Trees Visualization

Results

We can inspect the comparison results:

results
4×5 DataFrame
 Row │ method        ins_acc   oos_acc   ins_auc   oos_auc
     │ Symbol        Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────────────────
   1 │ completecase  0.941176  0.744681  0.911364  0.585135
   2 │ rand          0.953704  0.787234  0.893499  0.644595
   3 │ mean          0.962963  0.723404  0.967495  0.706757
   4 │ opt_knn       0.962963  0.765957  0.94186   0.744595

Notice that opt_knn has the highest out-of-sample performance compared to the more naive approaches, and more generally as we increase the quality of the imputation, we see better out-of-sample performance and less overfitting.

We can also see from the trees that after imputing the missing values with opt_knn, the validated tree uses features that had high levels of missingness in the original data. Coupled with the good out-of-sample results, this indicates that the imputations are enabling the tree to extract more meaningful signals from the important features in the data.

Together, these results give clear evidence that the quality of the imputations can have a significant impact on the performance of the final predictive model.