One common issue with outcome comparisons between groups is that the groups may have very different characteristics, making a direct statistical comparison invalid or inappropriate. For example, if the goal is to compare the performance across hospitals, some hospitals may see harder patients, and as a result, a direct comparison of mortality between hospitals may not be fair.

To solve this problem, trees can be used to construct cohorts with similar characteristics so that the baseline risk is similar in each cohort, which then allows us to make fair comparisons within each cohort. We call this process risk-adjusted benchmarking. First, we construct trees to identify cohorts that share similar patterns for a given outcome. Within each cohort, we then perform statistical comparisons between the groups using compare_group_outcomes to assess whether there are significant discrepancies in outcomes between the groups.


The following example showcases benchmarking with a continuous outcome using a Kaggle dataset comparing IMDB ratings for Marvel and DC movies/shows. For an example that uses benchmarking with a discrete outcome, see the detecting bias in jury selection case study.

Benchmarking will help us answer the following questions:

  1. Is there a significant difference in the IMDB ratings between Marvel and DC movies/shows, controlling for other factors such as genre and length?
  2. If there is such a difference, which segment(s) of movies/shows do either Marvel or DC excel at?

Data Preparation

We first clean the data so that each row corresponds to a movie or a series, and create features such as the genre indicator variables, movie lengths, and the year:

using CSV, DataFrames
using Statistics
using CategoricalArrays
df ="Marvel_DC_imdb.csv", DataFrame, pool=false)
df = df[completecases(df, :IMDB_Score), :]
df.Movie = lstrip.(df.Movie)

df.RunTime = map(df.RunTime) do t
  if ismissing(t)
    parse(Int, replace(t, " min" => ""))
df.Year = map(df.Year) do year
  if ismissing(year)
    year = replace(year, r"\(|\)|[^0-9.]" => "")
    parse(Int, year[1:4])

df = combine(groupby(df, [:Movie, :Genre, :Category,]),
             :IMDB_Score => mean,
             :Year => first,
             :RunTime => mean)
X = select(df, [:Movie; :RunTime_mean; :Year_first])
transform!(X, :Movie => categorical, renamecols=false)

dict = Dict()
for i in 1:size(df, 1)
  if ismissing(df[i, :Genre])
  items = split(df[i, :Genre], ",")
  for item in items
    if haskey(dict, item)
      dict[item] += 1
      dict[item] = 0
top_items = collect(keys(dict))
for top_item in top_items
  X[!, Symbol(top_item)] .= false
for i in 1:size(df, 1)
  if ismissing(df[i, :Genre])
  items = split(df[i, :Genre], ",")
  for top_item in top_items
    if top_item in items
      X[i, Symbol(top_item)] = true
y = df.IMDB_Score_mean
group = df.Category
270×25 DataFrame
 Row │ Movie                              RunTime_mean  Year_first  Family  Sh ⋯
     │ Cat…                               Float64?      Int64       Bool    Bo ⋯
   1 │ The Falcon and the Winter Soldier       50.5714        2021   false  fa ⋯
   2 │ WandaVision                             70.0           2021   false  fa
   3 │ Avengers: Endgame                      181.0           2019   false  fa
   4 │ Guardians of the Galaxy                121.0           2014   false  fa
   5 │ Spider-Man: Far from Home              129.0           2019   false  fa ⋯
   6 │ Thor: Ragnarok                         130.0           2017   false  fa
   7 │ Avengers: Infinity War                 149.0           2018   false  fa
   8 │ Black Panther                          134.0           2018   false  fa
  ⋮  │                 ⋮                       ⋮            ⋮         ⋮        ⋱
 264 │ The Flash                               42.1875        2014   false  fa ⋯
 265 │ Legends of the Superheroes         missing             1979   false  fa
 266 │ Supergirl                               42.186         2015   false  fa
 267 │ Lucifer                                 43.303         2016   false  fa
 268 │ Powerless                               21.5           2017   false  fa ⋯
 269 │ DC's Legends of Tomorrow                42.0           2016   false  fa
 270 │ Black Lightning                         42.0           2017   false  fa
                                                 21 columns and 255 rows omitted

Training an Optimal Tree

We will now fit an Optimal Regression Tree to predict the IMDB rating. The goal is not to maximize the prediction accuracy, but to learn different cohorts of movies/shows that share similar characteristics and ratings. These cohorts will form the basis of our benchmarking analysis.

grid = IAI.GridSearch(
)!(grid, X, y)
lnr = IAI.get_learner(grid)
Optimal Trees Visualization

Perform Outcome Comparison

Given this tree, we can then use compare_group_outcomes to generate statistics comparing Marvel and DC movies/shows in each cohort. Since each cohort has similar baseline IMDB ratings and characteristics, this means that any difference in Marvel and DC is not explained by other features in the data, and therefore gives a meaningful comparison:

outputs = IAI.compare_group_outcomes(lnr, X, y, group)

extras = map(1:length(outputs)) do i
  summary = outputs[i].summary
  p_value = outputs[i].p_value["vs-rest"]["DC"]
  node_color = if p_value > 0.1
  elseif summary.y_mean[1] > summary.y_mean[2]

  node_summary = "IMDB for $([1]): $(round(summary.y_mean[1], digits=3)); " *
                 "IMDB for $([2]): $(round(summary.y_mean[2], digits=3)) " *
                 "(p=$(round(p_value, digits=3)))"

  node_details = IAI.make_html_table(summary)

  Dict(:node_summary_include_default => false,
       :node_details_include_default => false,
       :node_summary_extra => node_summary,
       :node_details_extra => node_details,
       :node_color => node_color)
IAI.TreePlot(lnr, extra_content=extras)
Optimal Trees Visualization

We can see that for non-drama shows that run between 62-88 minutes, DC has a significantly higher IMDB rating than Marvel. In the other cohorts, there is no statistical significant difference, either because the sample size is low, or the difference is small.