Hepatitis Mortality

Hepatitis Mortality Prediction

In this example, we consider the Hepatitis dataset where the goal is to predict mortality of patients with hepatitis. As we will see, the dataset has a number of missing values, so we explore and compare many options for handling this missing data while building Optimal Classification Trees for this prediction task.

First, we will load the dataset:

using CSV
df = CSV.read(
    "hepatitis.data",
    missingstring="?",
    header=[:outcome, :age, :sex, :steroid, :antivirals, :fatigue, :malaise,
            :anorexia, :liver_big, :liver_firm, :spleen_palpable, :spiders,
            :ascites, :varices, :bilirubin, :alk_phosphate, :sgot, :albumin,
            :protime, :histology]
)
155×20 DataFrames.DataFrame. Omitted printing of 13 columns
│ Row │ outcome │ age   │ sex   │ steroid │ antivirals │ fatigue │ malaise │
│     │ Int64   │ Int64 │ Int64 │ Int64⍰  │ Int64      │ Int64⍰  │ Int64⍰  │
├─────┼─────────┼───────┼───────┼─────────┼────────────┼─────────┼─────────┤
│ 1   │ 2       │ 30    │ 2     │ 1       │ 2          │ 2       │ 2       │
│ 2   │ 2       │ 50    │ 1     │ 1       │ 2          │ 1       │ 2       │
│ 3   │ 2       │ 78    │ 1     │ 2       │ 2          │ 1       │ 2       │
│ 4   │ 2       │ 31    │ 1     │ missing │ 1          │ 2       │ 2       │
│ 5   │ 2       │ 34    │ 1     │ 2       │ 2          │ 2       │ 2       │
│ 6   │ 2       │ 34    │ 1     │ 2       │ 2          │ 2       │ 2       │
│ 7   │ 1       │ 51    │ 1     │ 1       │ 2          │ 1       │ 2       │
⋮
│ 148 │ 1       │ 70    │ 1     │ 1       │ 2          │ 1       │ 1       │
│ 149 │ 2       │ 20    │ 1     │ 1       │ 2          │ 2       │ 2       │
│ 150 │ 2       │ 36    │ 1     │ 2       │ 2          │ 2       │ 2       │
│ 151 │ 1       │ 46    │ 1     │ 2       │ 2          │ 1       │ 1       │
│ 152 │ 2       │ 44    │ 1     │ 2       │ 2          │ 1       │ 2       │
│ 153 │ 2       │ 61    │ 1     │ 1       │ 2          │ 1       │ 1       │
│ 154 │ 2       │ 53    │ 2     │ 1       │ 2          │ 1       │ 2       │
│ 155 │ 1       │ 43    │ 1     │ 2       │ 2          │ 1       │ 2       │

We observe there are a number of missing values:

using DataFrames, Statistics
DataFrame(col=names(df),
          missing_fraction=[mean(ismissing.(col)) for col in eachcol(df)])
20×2 DataFrames.DataFrame
│ Row │ col           │ missing_fraction │
│     │ Symbol        │ Float64          │
├─────┼───────────────┼──────────────────┤
│ 1   │ outcome       │ 0.0              │
│ 2   │ age           │ 0.0              │
│ 3   │ sex           │ 0.0              │
│ 4   │ steroid       │ 0.00645161       │
│ 5   │ antivirals    │ 0.0              │
│ 6   │ fatigue       │ 0.00645161       │
│ 7   │ malaise       │ 0.00645161       │
⋮
│ 13  │ ascites       │ 0.0322581        │
│ 14  │ varices       │ 0.0322581        │
│ 15  │ bilirubin     │ 0.0387097        │
│ 16  │ alk_phosphate │ 0.187097         │
│ 17  │ sgot          │ 0.0258065        │
│ 18  │ albumin       │ 0.103226         │
│ 19  │ protime       │ 0.432258         │
│ 20  │ histology     │ 0.0              │

To prepare for training, we extract the features and target, then randomly split into training and test datasets:

X = df[:, 2:end]
y = [val == 1 ? "Die" : "Live" for val in df.outcome]
seed = 123
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
                                                      seed=seed)

Method 1. Complete cases

Subsetting the data

Complete cases is a simple approach where we drop observations with any missing data for training:

train_X_cc = train_X[completecases(train_X), :]
train_y_cc = train_y[completecases(train_X)]
length(train_y_cc), length(train_y)
(55, 108)

Notice we are left with only 55 observations compared to 108 before subsetting. We have lost half of the training set, and so we might be losing some information and the model performance might suffer from this.

Model Training

Now train an optimal tree on the complete case data

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=seed,
        criterion=:gini,
    ),
    max_depth=2:6,
)
IAI.fit!(grid, train_X_cc, train_y_cc, validation_criterion=:auc)
IAI.get_learner(grid)
Optimal Trees Visualization

We see that the resulting validated tree contains only a single split. After throwing away all observations with missing values, it seems there is not enough signal left in the data to train a meaningful tree.

Evaluate the Model

If we restrict the test set to complete cases as well, we will only be left with 25 observations. However, other methods using imputation will make predictions on the entire testing test of 47 observations. Thus, we would not be able to compare the performance between the methods because they are not evaluated the same set of samples. For that reason, we will use mean-impute to fill in the test data so we will still have the same 47 observations when evaluating the complete cases model:

imputer = IAI.ImputationLearner(method=:mean)
IAI.fit!(imputer, train_X)
test_X_imputed = IAI.transform(imputer, test_X)

We save the evaluation results to a dataframe for comparisons later

results = DataFrame(
    method=:completecase,
    ins_acc=IAI.score(grid, train_X_cc, train_y_cc,
                      criterion=:misclassification),
    oos_acc=IAI.score(grid, test_X_imputed, test_y,
                      criterion=:misclassification),
    ins_auc=IAI.score(grid, train_X_cc, train_y_cc, criterion=:auc),
    oos_auc=IAI.score(grid, test_X_imputed, test_y, criterion=:auc),
)
1×5 DataFrames.DataFrame
│ Row │ method       │ ins_acc  │ oos_acc  │ ins_auc  │ oos_auc  │
│     │ Symbol       │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ completecase │ 0.909091 │ 0.702128 │ 0.821705 │ 0.482432 │

Methods 2-4: Using Imputation

Now we are going to explore various imputation methods provided by OptImpute. The pipeline we are going to use has the following steps:

  1. Train imputation learner and impute missing values on training set with fit_transform!
  2. Impute missing values on test set with transform
  3. Train predictive model on imputed training set
  4. Evaluate predictive model on imputed training and test sets
Info

When using imputation before training a predictive model, it is important to conduct the imputation using the process described in Steps 1 and 2 above. For more information, refer to the OptImpute documentation.

Let's now define a function implementing this pipeline that takes in a parameter specifying the imputation method to use:

function impute_then_predict(method)
  # Step 1: Train imputation learner and impute on training set
  imputer = IAI.ImputationLearner(
      method=method,
      cluster=true,
      cluster_max_size=100,
      random_seed=seed,
  )
  train_X_imputed = IAI.fit_transform!(imputer, train_X)

  # Step 2: Impute on test set
  test_X_imputed = IAI.transform(imputer, test_X)

  # Step 3: Train predictive model on imputed training set
  grid = IAI.GridSearch(
      IAI.OptimalTreeClassifier(
          random_seed=seed,
          criterion=:gini,
      ),
      max_depth=2:6,
  )
  IAI.fit!(grid, train_X_imputed, train_y, validation_criterion=:auc)

  # Step 4: Evaluate predictive model on imputed training and test sets
  append!(results, DataFrame(
      method=method,
      ins_acc=IAI.score(grid, train_X_imputed, train_y,
                      criterion=:misclassification),
      oos_acc=IAI.score(grid, test_X_imputed, test_y,
                      criterion=:misclassification),
      ins_auc=IAI.score(grid, train_X_imputed, train_y, criterion=:auc),
      oos_auc=IAI.score(grid, test_X_imputed, test_y, criterion=:auc)
  ))

  # Return predictive model
  IAI.get_learner(grid)
end

We will look at three imputation methods with this pipeline: random, mean, and optimal imputation using kNN

Method 2. Random imputation

First, we try imputing using a randomly selected value for each variable

lnr_rand = impute_then_predict(:rand)
Optimal Trees Visualization

Method 3. Mean imputation

Next, we impute using the mean for continuous variables and the mode for categoric variables:

lnr_mean = impute_then_predict(:mean)
Optimal Trees Visualization