Interpretable Clustering of Credit Card Data

In this example, we showcase how Optimal Trees can be used in order to solve clustering problems in an interpretable way. Clustering is an unsupervised learning method that aims to segment a set of observations into similar groups based on the available data. Traditional clustering methods, including k-means, assign points to clusters based on some distance metric, without providing any specific description of the clusters and the reasons why points get assigned to them. We showcase here how Optimal Trees can be used to identify meaningful clusters in an interpretable way.

  • The first method, Interpretable K-Means Clustering, developed in Chapter 20 of Machine Learning Under a Modern Optimization Lens (2019), consists of interpreting the results of a traditional clustering model, such as k-means, by fitting a multi-class Optimal Classification Tree to predict the cluster number assigned to each sample in our dataset by the clustering model. The resulting tree will thus provide, without any manual intervention, both interpretations of the clusters and insights into the way points are assigned to these clusters.
  • The second method, Guided Clustering, can be used in cases where we actually know in advance a specific measure based on which we would like to cluster our samples. For instance, in the case of clustering credit card holders based on usage activity, clusters relative to the demographics of the customers would be more useful in a marketing application, while clusters relative to credit responsibility would be more useful in a risk management application. Depending on the desired application, we can then run our clustering process guided by this specific measure by reframing the problem as a traditional supervised prediction problem. In the resulting tree model, the leaves contain buckets of samples that behave similarly under our selected measure, thus providing directly interpretable clusters.

Throughout this example, we will use a credit card dataset summarizing the usage behavior of 9,000 credit card holders in 17 features. This dataset is openly available on Kaggle.

Data Preparation

We start off by importing the dataset.

using CSV, DataFrames
df = CSV.read("credit_card_kaggle.csv", DataFrame)
select!(df, Not(:CUST_ID))
8950×17 DataFrame
  Row │ BALANCE     BALANCE_FREQUENCY  PURCHASES  ONEOFF_PURCHASES  INSTALLMEN ⋯
      │ Float64     Float64            Float64    Float64           Float64    ⋯
──────┼─────────────────────────────────────────────────────────────────────────
    1 │   40.9007            0.818182      95.4               0.0              ⋯
    2 │ 3202.47              0.909091       0.0               0.0
    3 │ 2495.15              1.0          773.17            773.17
    4 │ 1666.67              0.636364    1499.0            1499.0
    5 │  817.714             1.0           16.0              16.0              ⋯
    6 │ 1809.83              1.0         1333.28              0.0
    7 │  627.261             1.0         7091.01           6402.63
    8 │ 1823.65              1.0          436.2               0.0
  ⋮   │     ⋮               ⋮              ⋮             ⋮                     ⋱
 8944 │    5.87171           0.5           20.9              20.9              ⋯
 8945 │  193.572             0.833333    1012.73           1012.73
 8946 │   28.4935            1.0          291.12              0.0
 8947 │   19.1832            1.0          300.0               0.0
 8948 │   23.3987            0.833333     144.4               0.0              ⋯
 8949 │   13.4576            0.833333       0.0               0.0
 8950 │  372.708             0.666667    1093.25           1093.25
                                                13 columns and 8935 rows omitted

The data has some missing values, which we impute using OptImpute:

imputer = IAI.ImputationLearner(method=:opt_knn, random_seed=1)
df = IAI.fit_transform!(imputer, df)

Once the missing data is imputed, we scale the data to have zero mean and unit variance so the clustering is not affected by the scaling in different columns.

using Statistics
X = Matrix(df)
X_norm = (X .- mean(X, dims=1)) ./ std(X, dims=1)

Interpretable K-Means Clustering

This first method consists of two steps:

  • applying a standard clustering algorithm, in this case k-means, to our dataset, in order to output a vector assigning each sample to a cluster number;
  • fitting an Optimal Classification Tree to predict this vector, and analyzing the resulting model.

K-Means Clustering

For the purposes of this case, we will pick $k = 4$ clusters. We first apply the k-means algorithm to our scaled dataset, and compute the cluster assignment as a vector y.

using StableRNGs, ParallelKMeans

rng = StableRNG(1)
k = 4
clustering_result = kmeans(X_norm', k; rng=rng)
y = clustering_result.assignments
8950-element Vector{Int64}:
 3
 1
 4
 3
 3
 4
 2
 4
 3
 3
 ⋮
 1
 4
 3
 3
 4
 4
 4
 3
 3

The clustering defined by k-means is difficult to interpret. Each cluster is represented by its centroid in 17 dimensions (one for each feature), and each point is assigned to the cluster with the closest centroid. Therefore, in order to understand why a point was assigned to a given cluster, we also have to understand why it was not assigned to any other cluster, which requires making sense of the 17-dimensional centroids.

Option 1: Optimal Classification Tree

We can use Optimal Trees to understand which features drive the division of samples into different clusters. To do this, we train an Optimal Classification Tree with the cluster assignments given by the k-means algorithm as the target variable:

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(random_seed=1, minbucket=50),
    max_depth=2:4,
)
IAI.fit!(grid, df, y)
Optimal Trees Visualization

We can now look at the tree to understand how points are assigned to clusters. Note that the intensity of the color of the leaf (e.g. dark red vs. light red) represents the proportion of points in each leaf that have the predicted class.

The tree first splits on PURCHASES_FREQUENCY, where the lower frequencies (left side of the tree) tend to be of clusters 1 and 3, and the higher frequencies (right side of the tree) tend to be clusters 1, 2, and 4.

When the purchase frequency is lower, the tree uses BALANCE, CASH_ADVANCE, and CASH_ADVANCE_TRX to determine whether the customer is in cluster 1 or 3. For example, those with lower balances and cash advances tend to be in cluster 3.

Similarly, when the purchases frequency is higher, CASH_ADVANCE, CASH_ADVANCE_TRX, and PURCHASES are used to determine the cluster. For example, cluster 2 tends to be those customers with more purchase.

A brief analysis of this relatively simple tree thus allows us to draw the following conclusions on our four clusters:

  • Cluster 1 aggregates heavy but infrequent users: despite not making many frequent purchases, they maintain high balances and cash advances.
  • Cluster 2 aggregates heavy users: they make purchases relatively frequently, leading to higher purchases.
  • Cluster 3 aggregates light and cautious users: they make purchases relatively infrequently and have low balance and low cash advances.
  • Cluster 4 aggregates light but frequent users: despite making frequent purchases, they hold low purchases amount and fewer cash advances.

We can calculate the accuracy of this tree in predicting the clusters.

IAI.score(grid, df, y, criterion=:misclassification)
0.9249162011173184

With a simple application of Optimal Trees, we were thus able to come up with both precise insights on our four clusters as well as the paths leading a sample to each of those clusters. Given the high accuracy of this tree, it could now be used as a clustering model as is, as an interpretable version of our previous k-means model.

Option 2: Optimal Policy Tree

Instead of treating the cluster assignment as a labeled classification problem, another approach would be to give the model the relative distance to each of the cluster centroids, and use Optimal Policy Trees to assign points to clusters in order to minimize the average distance to the assigned cluster.

Let's first calculate the distance to each centroid and use this as the reward matrix.

n = size(df, 1)
rewards = zeros(n, k)
for i in 1:n
  for j in 1:k
    centroid = clustering_result.centers[:, j]
    rewards[i, j] = sqrt(sum(abs2, X_norm[i, :] .- centroid))
  end
end
rewards = DataFrame(rewards, Symbol.(1:k))
8950×4 DataFrame
  Row │ 1        2         3        4
      │ Float64  Float64   Float64  Float64
──────┼─────────────────────────────────────
    1 │ 5.05604   9.45417  1.24565  3.34653
    2 │ 2.63242   9.17259  3.58534  4.9135
    3 │ 5.40308   7.90728  4.16504  3.44633
    4 │ 4.38792   8.57097  1.83591  3.6516
    5 │ 4.82516   9.37676  1.42728  3.43788
    6 │ 4.85841   7.93035  2.81228  2.2223
    7 │ 8.83973   4.61903  8.49642  6.57509
    8 │ 5.27382   8.38233  3.41028  1.99441
  ⋮   │    ⋮        ⋮         ⋮        ⋮
 8944 │ 6.76995  10.6567   4.43031  5.69278
 8945 │ 6.28939   9.69875  4.31911  5.08049
 8946 │ 7.02789   9.88848  5.32942  4.69652
 8947 │ 6.83982   9.90472  5.13935  4.71792
 8948 │ 6.73615  10.0087   4.79318  4.673
 8949 │ 6.23375  10.6477   4.28184  5.59323
 8950 │ 6.32234   9.46181  5.00377  5.3574
                           8935 rows omitted

Now we can train an Optimal Policy Tree to minimize this distance.

policy_grid = IAI.GridSearch(
    IAI.OptimalTreePolicyMinimizer(random_seed=1, minbucket=50),
    max_depth=2:4,
)
IAI.fit!(policy_grid, df, rewards)
Optimal Trees Visualization