# Benchmarking

One common issue with outcome comparisons between groups is that the groups may have very different characteristics, making a direct statistical comparison invalid or inappropriate. For example, if the goal is to compare the performance across hospitals, some hospitals may see harder patients, and as a result, a direct comparison of mortality between hospitals may not be fair.

To solve this problem, trees can be used to construct cohorts with similar characteristics so that the baseline risk is similar in each cohort, which then allows us to make fair comparisons within each cohort. We call this process risk-adjusted benchmarking. First, we construct trees to identify cohorts that share similar patterns for a given outcome. Within each cohort, we then perform statistical comparisons between the groups using `compare_group_outcomes`

to assess whether there are significant discrepancies in outcomes between the groups.

## Example

The following example showcases benchmarking with a continuous outcome using a Kaggle dataset comparing IMDB ratings for Marvel and DC movies/shows. For an example that uses benchmarking with a discrete outcome, see the detecting bias in jury selection case study.

Benchmarking will help us answer the following questions:

- Is there a significant difference in the IMDB ratings between Marvel and DC movies/shows, controlling for other factors such as genre and length?
- If there is such a difference, which segment(s) of movies/shows do either Marvel or DC excel at?

### Data Preparation

We first clean the data so that each row corresponds to a movie or a series, and create features such as the genre indicator variables, movie lengths, and the year:

```
using CSV, DataFrames
using Statistics
using CategoricalArrays
df = CSV.read("Marvel_DC_imdb.csv", DataFrame, pool=false)
df = df[completecases(df, :IMDB_Score), :]
df.Movie = lstrip.(df.Movie)
df.RunTime = map(df.RunTime) do t
if ismissing(t)
missing
else
parse(Int, replace(t, " min" => ""))
end
end
df.Year = map(df.Year) do year
if ismissing(year)
missing
else
year = replace(year, r"\(|\)|[^0-9.]" => "")
parse(Int, year[1:4])
end
end
df = combine(groupby(df, [:Movie, :Genre, :Category,]),
:IMDB_Score => mean,
:Year => first,
:RunTime => mean)
X = select(df, [:Movie; :RunTime_mean; :Year_first])
transform!(X, :Movie => categorical, renamecols=false)
dict = Dict()
for i in 1:size(df, 1)
if ismissing(df[i, :Genre])
continue
end
items = split(df[i, :Genre], ",")
for item in items
if haskey(dict, item)
dict[item] += 1
else
dict[item] = 0
end
end
end
top_items = collect(keys(dict))
for top_item in top_items
X[!, Symbol(top_item)] .= false
end
for i in 1:size(df, 1)
if ismissing(df[i, :Genre])
continue
end
items = split(df[i, :Genre], ",")
for top_item in top_items
if top_item in items
X[i, Symbol(top_item)] = true
end
end
end
y = df.IMDB_Score_mean
group = df.Category
X
```

```
270×25 DataFrame
Row │ Movie RunTime_mean Year_first Family Sh ⋯
│ Cat… Float64? Int64 Bool Bo ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ The Falcon and the Winter Soldier 50.5714 2021 false fa ⋯
2 │ WandaVision 70.0 2021 false fa
3 │ Avengers: Endgame 181.0 2019 false fa
4 │ Guardians of the Galaxy 121.0 2014 false fa
5 │ Spider-Man: Far from Home 129.0 2019 false fa ⋯
6 │ Thor: Ragnarok 130.0 2017 false fa
7 │ Avengers: Infinity War 149.0 2018 false fa
8 │ Black Panther 134.0 2018 false fa
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱
264 │ The Flash 42.1875 2014 false fa ⋯
265 │ Legends of the Superheroes missing 1979 false fa
266 │ Supergirl 42.186 2015 false fa
267 │ Lucifer 43.303 2016 false fa
268 │ Powerless 21.5 2017 false fa ⋯
269 │ DC's Legends of Tomorrow 42.0 2016 false fa
270 │ Black Lightning 42.0 2017 false fa
21 columns and 255 rows omitted
```

### Training an Optimal Tree

We will now fit an Optimal Regression Tree to predict the IMDB rating. The goal is not to maximize the prediction accuracy, but to learn different cohorts of movies/shows that share similar characteristics and ratings. These cohorts will form the basis of our benchmarking analysis.

```
grid = IAI.GridSearch(
IAI.OptimalTreeRegressor(
split_features=Not(:Movie),
missingdatamode=:separate_class,
random_seed=123,
minbucket=3,
),
max_depth=2:5,
)
IAI.fit!(grid, X, y)
lnr = IAI.get_learner(grid)
```

### Perform Outcome Comparison

Given this tree, we can then use `compare_group_outcomes`

to generate statistics comparing Marvel and DC movies/shows in each cohort. Since each cohort has similar baseline IMDB ratings and characteristics, this means that any difference in Marvel and DC is not explained by other features in the data, and therefore gives a meaningful comparison:

```
outputs = IAI.compare_group_outcomes(lnr, X, y, group)
extras = map(1:length(outputs)) do i
summary = outputs[i].summary
p_value = outputs[i].p_value["vs-rest"]["DC"]
node_color = if p_value > 0.1
"#FFFFFF"
elseif summary.y_mean[1] > summary.y_mean[2]
"#FFADB4"
else
"#92D86F"
end
node_summary = "IMDB for $(summary.group[1]): $(round(summary.y_mean[1], digits=3)); " *
"IMDB for $(summary.group[2]): $(round(summary.y_mean[2], digits=3)) " *
"(p=$(round(p_value, digits=3)))"
node_details = IAI.make_html_table(summary)
Dict(:node_summary_include_default => false,
:node_details_include_default => false,
:node_summary_extra => node_summary,
:node_details_extra => node_details,
:node_color => node_color)
end
IAI.TreePlot(lnr, extra_content=extras)
```

We can see that for non-drama shows that run between 62-88 minutes, DC has a significantly higher IMDB rating than Marvel. In the other cohorts, there is no statistical significant difference, either because the sample size is low, or the difference is small.