Online Imputation for Production Pipelines

Missing data is an unfortunate reality of real-world data, and it must be accounted for to achieve maximum performance. Most attention, in theory and in practice, is given dealing with missing data while training the model, but it is equally important to consider how missing data might affect the online predictions of a model in production.

One key aspect to consider is that data availability can be different in the online setting. Consider a healthcare setting where we aim to predict the risk of adverse events based on patient data such as demographics, comorbidities and lab values. Our training set may come from patients that have visited a physician and had these variables measured as part of diagnosis, thus the training data has very few missing values. If we train a model with high out-of-sample performance on this data, we may have confidence that this will be useful for making predictions for new patients. However, if a user is accessing this tool from home, they may not know some values, and some features may be more likely to be missing than others (e.g. it may be unlikely that a home user can provide values for blood test measurements).

Factors like this can create structural differences between the data used to train the model and the data that it will face in production. It is critical to consider this as we are building models, as overlooking these changes can lead to unexpected, and potentially catastrophic decreases in prediction quality.

In this case study, we use the breast-cancer dataset as the context to investigate the potential impact of missing data in the online setting, and illustrate some potential remedies.

First, we will load the dataset:

using CSV, DataFrames
df = DataFrame(CSV.File(
    "breast-cancer-wisconsin.data",
    missingstring="?",
    header=["Sample_Code_Number", "Clump_Thickness", "Uniformity_of_Cell_Size",
            "Uniformity_of_Cell_Shape", "Marginal_Adhesion",
            "Single_Epithelial_Cell_Size", "Bare_Nuclei", "Bland_Chromatin",
            "Normal_Nucleoli", "Mitoses", "Class"],
))
699×11 DataFrame. Omitted printing of 8 columns
│ Row │ Sample_Code_Number │ Clump_Thickness │ Uniformity_of_Cell_Size │
│     │ Int64              │ Int64           │ Int64                   │
├─────┼────────────────────┼─────────────────┼─────────────────────────┤
│ 1   │ 1000025            │ 5               │ 1                       │
│ 2   │ 1002945            │ 5               │ 4                       │
│ 3   │ 1015425            │ 3               │ 1                       │
│ 4   │ 1016277            │ 6               │ 8                       │
│ 5   │ 1017023            │ 4               │ 1                       │
│ 6   │ 1017122            │ 8               │ 10                      │
│ 7   │ 1018099            │ 1               │ 1                       │
⋮
│ 692 │ 695091             │ 5               │ 10                      │
│ 693 │ 714039             │ 3               │ 1                       │
│ 694 │ 763235             │ 3               │ 1                       │
│ 695 │ 776715             │ 3               │ 1                       │
│ 696 │ 841769             │ 2               │ 1                       │
│ 697 │ 888820             │ 5               │ 10                      │
│ 698 │ 897471             │ 4               │ 8                       │
│ 699 │ 897471             │ 4               │ 8                       │

The goal is to predict whether a given breast tumor is benign or malignant based on a number of measurements relating to the tumor. The data was collected from clinical cases by a physician for research purposes, and thus it has been carefully curated and there are very few observations that have missing values. We will remove those observations and then split the data into training and test datasets:

df = dropmissing(df)
X = df[:, 2:(end - 1)]
y = [val == 4 ? "Malignant" : "Benign" for val in df[:, end]]
seed = 1
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
                                                      seed=seed)

Training and Evaluating the Model

We will train an OptimalTreeClassifier to predict the diagnosis of the tumor:

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=seed,
        criterion=:gini,
    ),
    max_depth=2:6,
)
IAI.fit!(grid, train_X, train_y, validation_criterion=:auc)
IAI.get_learner(grid)
Optimal Trees Visualization

Next, we evaluate the performance of the model:

results = DataFrame()
function evaluate!(grid, method, train_X, train_y, test_X, test_y)
  append!(results, DataFrame(
      method=method,
      ins_acc=IAI.score(grid, train_X, train_y,
                        criterion=:misclassification),
      oos_acc=IAI.score(grid, test_X, test_y,
                        criterion=:misclassification),
      ins_auc=IAI.score(grid, train_X, train_y, criterion=:auc),
      oos_auc=IAI.score(grid, test_X, test_y, criterion=:auc),
  ))
end
evaluate!(grid, :original, train_X, train_y, test_X, test_y)
1×5 DataFrame
│ Row │ method   │ ins_acc  │ oos_acc  │ ins_auc  │ oos_auc  │
│     │ Symbol   │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ original │ 0.949791 │ 0.921951 │ 0.979177 │ 0.972744 │

We see that we have found a simple tree model that has very strong out-of-sample performance in terms of accuracy and AUC. We might therefore conclude that this model is a good one and start using it to make predictions.

What if there are missing values for predictions?

As discussed earlier, the input data for an online prediction may often be incomplete. The user may need a quick answer before all the data become available, or the data are just unattainable for practical reasons. What's worse, sometimes the most important features for the model are missing.

In our case, we will consider a scenario where the Uniformity_of_Cell_Size variable is missing for new datapoints when making online predictions (e.g. perhaps this measurement is more expensive or takes longer to acquire.)

test_X_missing = allowmissing(test_X)
test_X_missing.Uniformity_of_Cell_Size .= missing

As it stands, our model can no longer make predictions on this data directly, as this now-missing variable is the first split in the tree.

IAI.predict(grid, test_X_missing)
ERROR: Don't know how to handle missing value
[...]

In order to make predictions on this data, we will need to explicitly deal with the missing values.

Online Mean Imputation

Our first inclination may be to reach for an imputation package to fill in the missing values so that we can make a prediction.

The default choice for imputation is often a simple imputation where the mean/mode from the training data is used to impute each observation of the new data. Indeed, this is what scikit-learn implements as its missing value strategy, so it is often used in practice. Let's try this on our data:

imputer = IAI.ImputationLearner(:mean)
IAI.fit!(imputer, train_X)
test_X_mean_imputed = IAI.transform(imputer, test_X_missing)
205×9 DataFrame. Omitted printing of 6 columns
│ Row │ Clump_Thickness │ Uniformity_of_Cell_Size │ Uniformity_of_Cell_Shape │
│     │ Float64?        │ Union{Missing, Float64} │ Union{Missing, Float64}  │
├─────┼─────────────────┼─────────────────────────┼──────────────────────────┤
│ 1   │ 5.0             │ 3.10251                 │ 4.0                      │
│ 2   │ 4.0             │ 3.10251                 │ 1.0                      │
│ 3   │ 8.0             │ 3.10251                 │ 10.0                     │
│ 4   │ 2.0             │ 3.10251                 │ 1.0                      │
│ 5   │ 1.0             │ 3.10251                 │ 1.0                      │
│ 6   │ 7.0             │ 3.10251                 │ 6.0                      │
│ 7   │ 6.0             │ 3.10251                 │ 1.0                      │
⋮
│ 198 │ 5.0             │ 3.10251                 │ 1.0                      │
│ 199 │ 1.0             │ 3.10251                 │ 1.0                      │
│ 200 │ 4.0             │ 3.10251                 │ 1.0                      │
│ 201 │ 1.0             │ 3.10251                 │ 1.0                      │
│ 202 │ 3.0             │ 3.10251                 │ 1.0                      │
│ 203 │ 3.0             │ 3.10251                 │ 1.0                      │
│ 204 │ 4.0             │ 3.10251                 │ 6.0                      │
│ 205 │ 4.0             │ 3.10251                 │ 8.0                      │

And look at the performance:

evaluate!(grid, :mean, train_X, train_y, test_X_mean_imputed, test_y)
2×5 DataFrame
│ Row │ method   │ ins_acc  │ oos_acc  │ ins_auc  │ oos_auc  │
│     │ Symbol   │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ original │ 0.949791 │ 0.921951 │ 0.979177 │ 0.972744 │
│ 2   │ mean     │ 0.949791 │ 0.907317 │ 0.979177 │ 0.912646 │

The performance is significantly worse than the original model. This is unsurprising - the value for the important variable Uniformity_of_Cell_Size is fixed at a constant. Looking at the decision tree, it will branch all observations to the left node at the first split (even though some should go right), making the predictions imprecise.

Online Optimal Imputation

The key issue with mean imputation is that it does not incorporate signals from other variables that are correlated with the important variable. This motivates use of more sophisticated imputation methods such as Optimal Imputation that aim to better capture the internal relationships between features and give more reasonable imputed values. Let's try imputing with an Optimal KNN model:

imputer = IAI.ImputationLearner(:opt_knn)
IAI.fit!(imputer, train_X)
test_X_opt_knn_imputed = IAI.transform(imputer, test_X_missing)
205×9 DataFrame. Omitted printing of 6 columns
│ Row │ Clump_Thickness │ Uniformity_of_Cell_Size │ Uniformity_of_Cell_Shape │
│     │ Float64?        │ Union{Missing, Float64} │ Union{Missing, Float64}  │
├─────┼─────────────────┼─────────────────────────┼──────────────────────────┤
│ 1   │ 5.0             │ 3.7                     │ 4.0                      │
│ 2   │ 4.0             │ 1.0                     │ 1.0                      │
│ 3   │ 8.0             │ 9.3                     │ 10.0                     │
│ 4   │ 2.0             │ 1.0                     │ 1.0                      │
│ 5   │ 1.0             │ 1.0                     │ 1.0                      │
│ 6   │ 7.0             │ 3.4                     │ 6.0                      │
│ 7   │ 6.0             │ 1.1                     │ 1.0                      │
⋮
│ 198 │ 5.0             │ 1.0                     │ 1.0                      │
│ 199 │ 1.0             │ 1.0                     │ 1.0                      │
│ 200 │ 4.0             │ 1.0                     │ 1.0                      │
│ 201 │ 1.0             │ 1.1                     │ 1.0                      │
│ 202 │ 3.0             │ 1.0                     │ 1.0                      │
│ 203 │ 3.0             │ 1.0                     │ 1.0                      │
│ 204 │ 4.0             │ 6.0                     │ 6.0                      │
│ 205 │ 4.0             │ 7.9                     │ 8.0                      │

And if we look at the performance:

evaluate!(grid, :opt_knn, train_X, train_y, test_X_opt_knn_imputed, test_y)
3×5 DataFrame
│ Row │ method   │ ins_acc  │ oos_acc  │ ins_auc  │ oos_auc  │
│     │ Symbol   │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ original │ 0.949791 │ 0.921951 │ 0.979177 │ 0.972744 │
│ 2   │ mean     │ 0.949791 │ 0.907317 │ 0.979177 │ 0.912646 │
│ 3   │ opt_knn  │ 0.949791 │ 0.926829 │ 0.979177 │ 0.979741 │

The results are significantly better than mean impute, and comparable to the results on the original data.

Simulate missing data in training

Some models can handle missing data directly, but if the training data does not have any missing data, the model will not be able to learn the optimal treatment of missing data. As an example, Optimal Trees can support missing data via "separate class" mode, where the trees will learn the best direction for missing data at each split. However, if the training data is complete, it cannot learn the best direction and instead will just use a default direction for all points, which is no different to mean imputation. To remedy this and train a model that knows how to handle missing data in prodiction, we can introduce missingness in the training data.

In our case, we will try this by removing approximately 15% of the values for this variable:

import Random
Random.seed!(seed)
missing_inds = rand(size(train_X, 1)) .< 0.15
train_X_partial = allowmissing(train_X)
train_X_partial.Uniformity_of_Cell_Size[missing_inds] .= missing

We can now train an Optimal Classification Tree with direct missing data support by setting missingdatamode to :separate_class:

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=seed,
        criterion=:gini,
        missingdatamode=:separate_class,
    ),
    max_depth=2:6,
)
IAI.fit!(grid, train_X_partial, train_y, validation_criterion=:auc)
IAI.get_learner(grid)
Optimal Trees Visualization

Note that the variable to which we have introduced missingness is still in the tree at the same threshold, but at a differentlocation in the tree. We can also see by the semi-circle on top of Node 4 that missing values are sent to the left-hand side of this split.

evaluate!(grid, :direct_model, train_X, train_y, test_X_missing, test_y)
4×5 DataFrame
│ Row │ method       │ ins_acc  │ oos_acc  │ ins_auc  │ oos_auc  │
│     │ Symbol       │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ original     │ 0.949791 │ 0.921951 │ 0.979177 │ 0.972744 │
│ 2   │ mean         │ 0.949791 │ 0.907317 │ 0.979177 │ 0.912646 │
│ 3   │ opt_knn      │ 0.949791 │ 0.926829 │ 0.979177 │ 0.979741 │
│ 4   │ direct_model │ 0.98954  │ 0.946341 │ 0.997083 │ 0.952433 │

The results suggest that the performance of this approach is again comparable to that with the original dataset, and significantly better than a simple mean imputation.

Train a simpler reduced model

Suppose we know ahead of time that one or more features may not be readily available for predictions in production. In these cases, we can train an alternative model that simply does not use these features, ensuring it can always be applied. There is also no need to use the same model for all predictions: we might choose to use the original model on a case-by-case basis when full data is available for a given prediction, if the original model is significantly stronger.

To illustrate this, we will train a variant of our model by removing the potentially-missing feature from our training set entirely:

train_X_reduced = select(train_X, Not(:Uniformity_of_Cell_Size))
grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=seed,
        criterion=:gini,
    ),
    max_depth=2:6,
)
IAI.fit!(grid, train_X_reduced, train_y, validation_criterion=:auc)
IAI.get_learner(grid)
Optimal Trees Visualization