Quick Start Guide: Multi-task Optimal Regression Trees
This is an R version of the corresponding OptimalTrees quick start guide.
In this example we will use Optimal Regression Trees (ORT) on the concrete slump test dataset to solve a multi-task regression problem.
This guide assumes you are familiar with ORTs and focuses on aspects that are unique to the multi-task setting. For a general introduction to ORTs, please refer to the ORT quickstart guide.
First we load in the data and split into training and test datasets:
df <- read.csv("slump_test.data")
No Cement Slag Fly.ash Water SP Coarse.Aggr. Fine.Aggr. SLUMP.cm. FLOW.cm.
1 1 273 82 105 210 9 904 680 23 62.0
2 2 163 149 191 180 12 843 746 0 20.0
3 3 162 148 191 179 16 840 743 1 20.0
4 4 162 148 190 179 19 838 741 3 21.5
5 5 154 112 144 220 10 923 658 20 64.0
Compressive.Strength..28.day..Mpa.
1 34.99
2 41.14
3 41.81
4 42.08
5 26.82
[ reached 'max' / getOption("max.print") -- omitted 98 rows ]
The goal is to predict three characteristics of concrete from other properties. We therefore separate these targets from the rest of the features, and split for training and testing:
colnames(df)[colnames(df) == "SLUMP.cm."] <- "Slump"
colnames(df)[colnames(df) == "FLOW.cm."] <- "Flow"
colnames(df)[colnames(df) == "Compressive.Strength..28.day..Mpa."] <- "Strength"
targets = c("Slump", "Flow", "Strength")
X <- df[, !names(df) %in% c("No", targets)]
y <- df[targets]
split <- iai::split_data("multi_regression", X, y, seed = 1)
train_X <- split$train$X
train_y <- split$train$y
test_X <- split$test$X
test_y <- split$test$y
Multi-task Optimal Regression Trees
We will use a grid_search
to fit an optimal_tree_multi_regressor
:
grid <- iai::grid_search(
iai::optimal_tree_multi_regressor(
random_seed = 1,
),
max_depth = 1:5,
)
iai::fit(grid, train_X, train_y)
iai::get_learner(grid)
We can make predictions on new data using predict
:
iai::predict(grid, test_X)
$Slump
[1] 8.70000 20.55172 20.55172 20.55172 20.55172 20.55172 20.55172 20.55172
[9] 20.55172 8.70000 8.70000 8.70000 20.55172 20.55172 20.55172 20.55172
[17] 20.55172 20.55172 8.70000 20.55172 20.55172 20.55172 20.55172 0.50000
[25] 20.55172 20.55172 0.50000 20.55172 0.50000 20.55172 20.55172
$Flow
[1] 27.45 54.40 54.40 54.40 54.40 54.40 54.40 54.40 54.40 27.45 27.45 27.45
[13] 54.40 54.40 54.40 54.40 54.40 54.40 27.45 54.40 54.40 54.40 54.40 20.00
[25] 54.40 54.40 20.00 54.40 20.00 54.40 54.40
$Strength
[1] 30.08700 35.01776 35.01776 35.01776 35.01776 35.01776 35.01776 35.01776
[9] 35.01776 30.08700 30.08700 30.08700 35.01776 35.01776 35.01776 35.01776
[17] 35.01776 35.01776 30.08700 35.01776 35.01776 35.01776 35.01776 45.68750
[25] 35.01776 35.01776 45.68750 35.01776 45.68750 35.01776 35.01776
This returns a list containing the predictions for each of the tasks, and can also be converted to a dataframe easily:
as.data.frame(iai::predict(grid, test_X))
Slump Flow Strength
1 8.70000 27.45 30.08700
2 20.55172 54.40 35.01776
3 20.55172 54.40 35.01776
4 20.55172 54.40 35.01776
5 20.55172 54.40 35.01776
6 20.55172 54.40 35.01776
7 20.55172 54.40 35.01776
8 20.55172 54.40 35.01776
9 20.55172 54.40 35.01776
10 8.70000 27.45 30.08700
11 8.70000 27.45 30.08700
12 8.70000 27.45 30.08700
13 20.55172 54.40 35.01776
14 20.55172 54.40 35.01776
15 20.55172 54.40 35.01776
16 20.55172 54.40 35.01776
17 20.55172 54.40 35.01776
18 20.55172 54.40 35.01776
19 8.70000 27.45 30.08700
20 20.55172 54.40 35.01776
[ reached 'max' / getOption("max.print") -- omitted 11 rows ]
We can also generate the predictions for a specific task by passing the task label:
iai::predict(grid, test_X, "Slump")
[1] 8.70000 20.55172 20.55172 20.55172 20.55172 20.55172 20.55172 20.55172
[9] 20.55172 8.70000 8.70000 8.70000 20.55172 20.55172 20.55172 20.55172
[17] 20.55172 20.55172 8.70000 20.55172 20.55172 20.55172 20.55172 0.50000
[25] 20.55172 20.55172 0.50000 20.55172 0.50000 20.55172 20.55172
We can evaluate the quality of the tree using score
with any of the supported loss functions. For multi-task problems, the returned score is the average of the scores of the individual tasks:
iai::score(grid, test_X, test_y)
[1] -0.1516529
We can also calculate the score of a single task by specifying this task:
iai::score(grid, test_X, test_y, "Flow")
[1] -0.163777
Extensions
The standard ORT extensions (e.g. hyperplane splits, linear regression) are also available in the multi-task setting and controlled in the usual way.
For instance, we can use Optimal Regression Trees with hyperplane splits:
grid <- iai::grid_search(
iai::optimal_tree_multi_regressor(
random_seed = 1,
max_depth = 2,
hyperplane_config = list(sparsity = "all"),
),
)
iai::fit(grid, train_X, train_y)
iai::get_learner(grid)