Online Imputation for Production Pipelines

Missing data is an unfortunate reality of real-world data, and it must be accounted for to achieve maximum performance. Most attention, in theory and in practice, is given dealing with missing data while training the model, but it is equally important to consider how missing data might affect the online predictions of a model in production.

One key aspect to consider is that data availability can be different in the online setting. Consider a healthcare setting where we aim to predict the risk of adverse events based on patient data such as demographics, comorbidities and lab values. Our training set may come from patients that have visited a physician and had these variables measured as part of diagnosis, thus the training data has very few missing values. If we train a model with high out-of-sample performance on this data, we may have confidence that this will be useful for making predictions for new patients. However, if a user is accessing this tool from home, they may not know some values, and some features may be more likely to be missing than others (e.g. it may be unlikely that a home user can provide values for blood test measurements).

Factors like this can create structural differences between the data used to train the model and the data that it will face in production. It is critical to consider this as we are building models, as overlooking these changes can lead to unexpected, and potentially catastrophic decreases in prediction quality.

In this case study, we use the breast-cancer dataset as the context to investigate the potential impact of missing data in the online setting, and illustrate some potential remedies.

First, we will load the dataset:

using CSV, DataFrames
df = CSV.read(
    "breast-cancer-wisconsin.data", DataFrame,
    missingstring="?",
    header=["Sample_Code_Number", "Clump_Thickness", "Uniformity_of_Cell_Size",
            "Uniformity_of_Cell_Shape", "Marginal_Adhesion",
            "Single_Epithelial_Cell_Size", "Bare_Nuclei", "Bland_Chromatin",
            "Normal_Nucleoli", "Mitoses", "Class"],
)
699×11 DataFrame
 Row │ Sample_Code_Number  Clump_Thickness  Uniformity_of_Cell_Size  Uniformit ⋯
     │ Int64               Int64            Int64                    Int64     ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │            1000025                5                        1            ⋯
   2 │            1002945                5                        4
   3 │            1015425                3                        1
   4 │            1016277                6                        8
   5 │            1017023                4                        1            ⋯
   6 │            1017122                8                       10
   7 │            1018099                1                        1
   8 │            1018561                2                        1
  ⋮  │         ⋮                  ⋮                    ⋮                       ⋱
 693 │             714039                3                        1            ⋯
 694 │             763235                3                        1
 695 │             776715                3                        1
 696 │             841769                2                        1
 697 │             888820                5                       10            ⋯
 698 │             897471                4                        8
 699 │             897471                4                        8
                                                  8 columns and 684 rows omitted

The goal is to predict whether a given breast tumor is benign or malignant based on a number of measurements relating to the tumor. The data was collected from clinical cases by a physician for research purposes, and thus it has been carefully curated and there are very few observations that have missing values. We will remove those observations and then split the data into training and test datasets:

df = dropmissing(df)
X = df[:, 2:(end - 1)]
y = [val == 4 ? "Malignant" : "Benign" for val in df[:, end]]
seed = 2345
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
                                                      seed=seed)

Training and Evaluating the Model

We will train an OptimalTreeClassifier to predict the diagnosis of the tumor:

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=seed,
        criterion=:gini,
    ),
    max_depth=2:6,
)
IAI.fit!(grid, train_X, train_y, validation_criterion=:auc)
IAI.get_learner(grid)
Optimal Trees Visualization

Next, we evaluate the performance of the model:

results = DataFrame()
function evaluate!(grid, method, train_X, train_y, test_X, test_y)
  append!(results, DataFrame(
      method=method,
      ins_acc=IAI.score(grid, train_X, train_y,
                        criterion=:misclassification),
      oos_acc=IAI.score(grid, test_X, test_y,
                        criterion=:misclassification),
      ins_auc=IAI.score(grid, train_X, train_y, criterion=:auc),
      oos_auc=IAI.score(grid, test_X, test_y, criterion=:auc),
  ))
end
evaluate!(grid, :original, train_X, train_y, test_X, test_y)
1×5 DataFrame
 Row │ method    ins_acc   oos_acc   ins_auc   oos_auc
     │ Symbol    Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────────────
   1 │ original  0.966527  0.970732  0.978166  0.980838

We see that we have found a simple tree model that has very strong out-of-sample performance in terms of accuracy and AUC. We might therefore conclude that this model is a good one and start using it to make predictions.

What if there are missing values for predictions?

As discussed earlier, the input data for an online prediction may often be incomplete. The user may need a quick answer before all the data become available, or the data are just unattainable for practical reasons. What's worse, sometimes the most important features for the model are missing.

In our case, we will consider a scenario where the Uniformity_of_Cell_Size variable is missing for new datapoints when making online predictions (e.g. perhaps this measurement is more expensive or takes longer to acquire.)

test_X_missing = allowmissing(test_X)
test_X_missing.Uniformity_of_Cell_Size[:] .= missing

As it stands, our model can no longer make predictions on this data directly, as this now-missing variable is the first split in the tree.

IAI.predict(grid, test_X_missing)
ERROR: TaskFailedException

    nested task error: ArgumentError: Don't know how to handle missing value
[...]

In order to make predictions on this data, we will need to explicitly deal with the missing values.

Online Mean Imputation

Our first inclination may be to reach for an imputation package to fill in the missing values so that we can make a prediction.

The default choice for imputation is often a simple imputation where the mean/mode from the training data is used to impute each observation of the new data. Indeed, this is what scikit-learn implements as its missing value strategy, so it is often used in practice. Let's try this on our data:

imputer = IAI.ImputationLearner(:mean)
IAI.fit!(imputer, train_X)
test_X_mean_imputed = IAI.transform(imputer, test_X_missing)
205×9 DataFrame
 Row │ Clump_Thickness  Uniformity_of_Cell_Size  Uniformity_of_Cell_Shape  Mar ⋯
     │ Float64?         Float64?                 Float64?                  Flo ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │             5.0                  3.16318                       1.0      ⋯
   2 │             4.0                  3.16318                       1.0
   3 │             8.0                  3.16318                      10.0
   4 │             4.0                  3.16318                       1.0
   5 │             4.0                  3.16318                       1.0      ⋯
   6 │             6.0                  3.16318                       1.0
   7 │             5.0                  3.16318                       3.0
   8 │             5.0                  3.16318                       1.0
  ⋮  │        ⋮                    ⋮                        ⋮                  ⋱
 199 │             2.0                  3.16318                       1.0      ⋯
 200 │             1.0                  3.16318                       1.0
 201 │             1.0                  3.16318                       1.0
 202 │             1.0                  3.16318                       1.0
 203 │             4.0                  3.16318                       1.0      ⋯
 204 │             2.0                  3.16318                       1.0
 205 │             5.0                  3.16318                      10.0
                                                  6 columns and 190 rows omitted

And look at the performance:

evaluate!(grid, :mean, train_X, train_y, test_X_mean_imputed, test_y)
2×5 DataFrame
 Row │ method    ins_acc   oos_acc   ins_auc   oos_auc
     │ Symbol    Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────────────
   1 │ original  0.966527  0.970732  0.978166  0.980838
   2 │ mean      0.966527  0.946341  0.978166  0.943818

The performance is significantly worse than the original model. This is unsurprising - the value for the important variable Uniformity_of_Cell_Size is fixed at a constant. Looking at the decision tree, it will branch all observations to the left node at Node 2 (even though some should go right), making the predictions imprecise.

Online Optimal Imputation

The key issue with mean imputation is that it does not incorporate signals from other variables that are correlated with the important variable. This motivates use of more sophisticated imputation methods such as Optimal Imputation that aim to better capture the internal relationships between features and give more reasonable imputed values. Let's try imputing with an Optimal KNN model:

imputer = IAI.ImputationLearner(:opt_knn)
IAI.fit!(imputer, train_X)
test_X_opt_knn_imputed = IAI.transform(imputer, test_X_missing)
205×9 DataFrame
 Row │ Clump_Thickness  Uniformity_of_Cell_Size  Uniformity_of_Cell_Shape  Mar ⋯
     │ Float64?         Float64?                 Float64?                  Flo ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │             5.0                      1.1                       1.0      ⋯
   2 │             4.0                      1.1                       1.0
   3 │             8.0                      9.6                      10.0
   4 │             4.0                      1.0                       1.0
   5 │             4.0                      1.0                       1.0      ⋯
   6 │             6.0                      1.1                       1.0
   7 │             5.0                      3.0                       3.0
   8 │             5.0                      1.0                       1.0
  ⋮  │        ⋮                    ⋮                        ⋮                  ⋱
 199 │             2.0                      1.0                       1.0      ⋯
 200 │             1.0                      1.0                       1.0
 201 │             1.0                      1.0                       1.0
 202 │             1.0                      1.0                       1.0
 203 │             4.0                      1.1                       1.0      ⋯
 204 │             2.0                      1.0                       1.0
 205 │             5.0                      8.3                      10.0
                                                  6 columns and 190 rows omitted

And if we look at the performance:

evaluate!(grid, :opt_knn, train_X, train_y, test_X_opt_knn_imputed, test_y)
3×5 DataFrame
 Row │ method    ins_acc   oos_acc   ins_auc   oos_auc
     │ Symbol    Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────────────
   1 │ original  0.966527  0.970732  0.978166  0.980838
   2 │ mean      0.966527  0.946341  0.978166  0.943818
   3 │ opt_knn   0.966527  0.97561   0.978166  0.9882

The results are significantly better than mean impute, and comparable to the results on the original data.

Simulate missing data in training

Some models can handle missing data directly, but if the training data does not have any missing data, the model will not be able to learn the optimal treatment of missing data. As an example, Optimal Trees can support missing data via "separate class" mode, where the trees will learn the best direction for missing data at each split. However, if the training data is complete, it cannot learn the best direction and instead will just use a default direction for all points, which is no different to mean imputation. To remedy this and train a model that knows how to handle missing data in prediction, we can introduce missingness in the training data.

In our case, we will try this by removing approximately 2% of the values for this variable:

using StableRNGs
rng = StableRNG(seed)
missing_inds = rand(rng, size(train_X, 1)) .< 0.02
train_X_partial = allowmissing(train_X)
train_X_partial.Uniformity_of_Cell_Size[missing_inds] .= missing

We can now train an Optimal Classification Tree with direct missing data support by setting missingdatamode to :separate_class:

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=seed,
        criterion=:gini,
        missingdatamode=:separate_class,
    ),
    max_depth=2:6,
)
IAI.fit!(grid, train_X_partial, train_y, validation_criterion=:auc)
IAI.get_learner(grid)
Optimal Trees Visualization

Note that the variable to which we have introduced missingness is now at the top of the tree, with a different threshold. We can also see by the semi-circle on top of Node 3 that missing values are sent to the right-hand side of this split.

evaluate!(grid, :direct_model, train_X, train_y, test_X_missing, test_y)
4×5 DataFrame
 Row │ method        ins_acc   oos_acc   ins_auc   oos_auc
     │ Symbol        Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────────────────
   1 │ original      0.966527  0.970732  0.978166  0.980838
   2 │ mean          0.966527  0.946341  0.978166  0.943818
   3 │ opt_knn       0.966527  0.97561   0.978166  0.9882
   4 │ direct_model  0.979079  0.95122   0.98451   0.961571

The results suggest that the performance of this approach is closer to that with the original dataset, and significantly better than a simple mean imputation.

Train a simpler reduced model

Suppose we know ahead of time that one or more features may not be readily available for predictions in production. In these cases, we can train an alternative model that simply does not use these features, ensuring it can always be applied. There is also no need to use the same model for all predictions: we might choose to use the original model on a case-by-case basis when full data is available for a given prediction, if the original model is significantly stronger.

To illustrate this, we will train a variant of our model by removing the potentially-missing feature from our training set entirely:

train_X_reduced = select(train_X, Not(:Uniformity_of_Cell_Size))
grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed=seed,
        criterion=:gini,
    ),
    max_depth=2:6,
)
IAI.fit!(grid, train_X_reduced, train_y, validation_criterion=:auc)
IAI.get_learner(grid)
Optimal Trees Visualization

And then examine the performance:

test_X_reduced = select(test_X, Not(:Uniformity_of_Cell_Size))
evaluate!(grid, :reduced_model, train_X_reduced, train_y, test_X_reduced,
          test_y)
5×5 DataFrame
 Row │ method         ins_acc   oos_acc   ins_auc   oos_auc
     │ Symbol         Float64   Float64   Float64   Float64
─────┼───────────────────────────────────────────────────────
   1 │ original       0.966527  0.970732  0.978166  0.980838
   2 │ mean           0.966527  0.946341  0.978166  0.943818
   3 │ opt_knn        0.966527  0.97561   0.978166  0.9882
   4 │ direct_model   0.979079  0.95122   0.98451   0.961571
   5 │ reduced_model  0.947699  0.960976  0.960539  0.963868

This model on reduced data also performs closer to the original model than the mean imputation.

To explain this, we know from the strong performance with Optimal Imputation that it is possible to impute the important feature Uniformity_of_Cell_Size using the other features, which is then used by the original model to make predictions. In contrast, this reduced-model approach directly predicts the outcome using these other features, bypassing the intermediate imputation step.

Conclusions

In this case, we saw that changing circumstances between a model's training and its eventual use can have a significant detrimental effect on performance. The key takeaways are:

  1. Data scientists need to think carefully about how a model will be used and address this during model construction. In particular, factors such as whether variables will be available at model deployment must be accounted for when evaluating and selecting models in order to get a realistic measurement of out-of-sample performance.
  2. There are many ways to handle unknown values in prediction data. However, rudimentary, off-the-shelf approaches such as mean imputation can significantly weaken the model performance. More sophisticated methods like Optimal Imputation can achieve performance similar to the original data.
  3. Introducing missingness in the training data to simulate the data scenario in production allows us to train a model that handles missingness with care, and can achieve strong performance.
  4. Training a model only on the features we expect to be fully-present at prediction time is a simple-yet-strong solution. By avoiding an intermediate imputation step, it may also be easier to explain the model and build trust.