One common issue with outcome comparisons between groups is that the groups may have very different characteristics, making a direct statistical comparison invalid or inappropriate. For example, if the goal is to compare the performance across hospitals, some hospitals may see harder patients, and as a result, a direct comparison of mortality between hospitals may not be fair.
To solve this problem, trees can be used to construct cohorts with similar characteristics so that the baseline risk is similar in each cohort, which then allows us to make fair comparisons within each cohort. We call this process risk-adjusted benchmarking. First, we construct trees to identify cohorts that share similar patterns for a given outcome. Within each cohort, we then perform statistical comparisons between the groups using
compare_group_outcomes to assess whether there are significant discrepancies in outcomes between the groups.
The following example showcases benchmarking with a continuous outcome using a Kaggle dataset comparing IMDB ratings for Marvel and DC movies/shows. For an example that uses benchmarking with a discrete outcome, see the detecting bias in jury selection case study.
Benchmarking will help us answer the following questions:
- Is there a significant difference in the IMDB ratings between Marvel and DC movies/shows, controlling for other factors such as genre and length?
- If there is such a difference, which segment(s) of movies/shows do either Marvel or DC excel at?
We first clean the data so that each row corresponds to a movie or a series, and create features such as the genre indicator variables, movie lengths, and the year:
using CSV, DataFrames using Statistics using CategoricalArrays df = CSV.read("Marvel_DC_imdb.csv", DataFrame, pool=false) df = df[completecases(df, :IMDB_Score), :] df.Movie = lstrip.(df.Movie) df.RunTime = map(df.RunTime) do t if ismissing(t) missing else parse(Int, replace(t, " min" => "")) end end df.Year = map(df.Year) do year if ismissing(year) missing else year = replace(year, r"\(|\)|[^0-9.]" => "") parse(Int, year[1:4]) end end df = combine(groupby(df, [:Movie, :Genre, :Category,]), :IMDB_Score => mean, :Year => first, :RunTime => mean) X = select(df, [:Movie; :RunTime_mean; :Year_first]) transform!(X, :Movie => categorical, renamecols=false) dict = Dict() for i in 1:size(df, 1) if ismissing(df[i, :Genre]) continue end items = split(df[i, :Genre], ",") for item in items if haskey(dict, item) dict[item] += 1 else dict[item] = 0 end end end top_items = collect(keys(dict)) for top_item in top_items X[!, Symbol(top_item)] .= false end for i in 1:size(df, 1) if ismissing(df[i, :Genre]) continue end items = split(df[i, :Genre], ",") for top_item in top_items if top_item in items X[i, Symbol(top_item)] = true end end end y = df.IMDB_Score_mean group = df.Category X
270×25 DataFrame Row │ Movie RunTime_mean Year_first Family Sh ⋯ │ Cat… Float64? Int64 Bool Bo ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ The Falcon and the Winter Soldier 50.5714 2021 false fa ⋯ 2 │ WandaVision 70.0 2021 false fa 3 │ Avengers: Endgame 181.0 2019 false fa 4 │ Guardians of the Galaxy 121.0 2014 false fa 5 │ Spider-Man: Far from Home 129.0 2019 false fa ⋯ 6 │ Thor: Ragnarok 130.0 2017 false fa 7 │ Avengers: Infinity War 149.0 2018 false fa 8 │ Black Panther 134.0 2018 false fa ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 264 │ The Flash 42.1875 2014 false fa ⋯ 265 │ Legends of the Superheroes missing 1979 false fa 266 │ Supergirl 42.186 2015 false fa 267 │ Lucifer 43.303 2016 false fa 268 │ Powerless 21.5 2017 false fa ⋯ 269 │ DC's Legends of Tomorrow 42.0 2016 false fa 270 │ Black Lightning 42.0 2017 false fa 21 columns and 255 rows omitted
We will now fit an Optimal Regression Tree to predict the IMDB rating. The goal is not to maximize the prediction accuracy, but to learn different cohorts of movies/shows that share similar characteristics and ratings. These cohorts will form the basis of our benchmarking analysis.
grid = IAI.GridSearch( IAI.OptimalTreeRegressor( split_features=Not(:Movie), missingdatamode=:separate_class, random_seed=123, minbucket=3, ), max_depth=2:5, ) IAI.fit!(grid, X, y) lnr = IAI.get_learner(grid)
Given this tree, we can then use
compare_group_outcomes to generate statistics comparing Marvel and DC movies/shows in each cohort. Since each cohort has similar baseline IMDB ratings and characteristics, this means that any difference in Marvel and DC is not explained by other features in the data, and therefore gives a meaningful comparison:
outputs = IAI.compare_group_outcomes(lnr, X, y, group) extras = map(1:length(outputs)) do i summary = outputs[i].summary p_value = outputs[i].p_value["vs-rest"]["DC"] node_color = if p_value > 0.1 "#FFFFFF" elseif summary.y_mean > summary.y_mean "#FFADB4" else "#92D86F" end node_summary = "IMDB for $(summary.group): $(round(summary.y_mean, digits=3)); " * "IMDB for $(summary.group): $(round(summary.y_mean, digits=3)) " * "(p=$(round(p_value, digits=3)))" node_details = IAI.make_html_table(summary) Dict(:node_summary_include_default => false, :node_details_include_default => false, :node_summary_extra => node_summary, :node_details_extra => node_details, :node_color => node_color) end IAI.TreePlot(lnr, extra_content=extras)
We can see that for non-drama shows that run between 62-88 minutes, DC has a significantly higher IMDB rating than Marvel. In the other cohorts, there is no statistical significant difference, either because the sample size is low, or the difference is small.