Quick Start Guide: Multi-task Optimal Regression Trees
This is a Python version of the corresponding OptimalTrees quick start guide.
In this example we will use Optimal Regression Trees (OCT) on the concrete slump test dataset to solve a multi-task regression problem.
This guide assumes you are familiar with ORTs and focuses on aspects that are unique to the multi-task setting. For a general introduction to ORTs, please refer to the ORT quickstart guide.
First we load in the data and split into training and test datasets:
import pandas as pd
df = pd.read_csv('slump_test.data')
No Cement ... FLOW(cm) Compressive Strength (28-day)(Mpa)
0 1 273.0 ... 62.0 34.99
1 2 163.0 ... 20.0 41.14
2 3 162.0 ... 20.0 41.81
3 4 162.0 ... 21.5 42.08
4 5 154.0 ... 64.0 26.82
5 6 147.0 ... 55.0 25.21
6 7 152.0 ... 20.0 38.86
.. ... ... ... ... ...
96 97 215.6 ... 64.0 39.13
97 98 295.3 ... 77.0 44.08
98 99 248.3 ... 20.0 49.97
99 100 248.0 ... 20.0 50.23
100 101 258.8 ... 20.0 50.50
101 102 297.1 ... 67.0 49.17
102 103 348.7 ... 78.0 48.77
[103 rows x 11 columns]
The goal is to predict three characteristics of concrete from other properties. We therefore separate these targets from the rest of the features, and split for training and testing:
from interpretableai import iai
df = df.rename(columns={
"SLUMP(cm)": "Slump",
"FLOW(cm)": "Flow",
"Compressive Strength (28-day)(Mpa)": "Strength",
})
targets = ['Slump', 'Flow', 'Strength']
X = df.drop(['No', *targets], axis=1)
y = df[targets]
(train_X, train_y), (test_X, test_y) = iai.split_data('multi_regression',
X, y, seed=1)
Multi-task Optimal Regression Trees
We will use a GridSearch
to fit an OptimalTreeMultiRegressor
:
grid = iai.GridSearch(
iai.OptimalTreeMultiRegressor(
random_seed=1,
),
max_depth=range(1, 6),
)
grid.fit(train_X, train_y)
grid.get_learner()
We can make predictions on new data using predict
:
grid.predict(test_X)
{'Slump': array([ 8.7 , 20.55172414, 20.55172414, ..., 0.5 ,
20.55172414, 20.55172414]), 'Flow': array([27.45, 54.4 , 54.4 , ..., 20. , 54.4 , 54.4 ]), 'Strength': array([30.087 , 35.01775862, 35.01775862, ..., 45.6875 ,
35.01775862, 35.01775862])}
This returns a dictionary containing the predictions for each of the tasks, and can also be converted to a dataframe easily:
pd.DataFrame(grid.predict(test_X))
Slump Flow Strength
0 8.700000 27.45 30.087000
1 20.551724 54.40 35.017759
2 20.551724 54.40 35.017759
3 20.551724 54.40 35.017759
4 20.551724 54.40 35.017759
5 20.551724 54.40 35.017759
6 20.551724 54.40 35.017759
.. ... ... ...
24 20.551724 54.40 35.017759
25 20.551724 54.40 35.017759
26 0.500000 20.00 45.687500
27 20.551724 54.40 35.017759
28 0.500000 20.00 45.687500
29 20.551724 54.40 35.017759
30 20.551724 54.40 35.017759
[31 rows x 3 columns]
We can also generate the predictions for a specific task by passing the task label:
grid.predict(test_X, 'Slump')
array([ 8.7 , 20.55172414, 20.55172414, ..., 0.5 ,
20.55172414, 20.55172414])
We can evaluate the quality of the tree using score
with any of the supported loss functions. For multi-task problems, the returned score is the average of the scores of the individual tasks:
grid.score(test_X, test_y)
-0.1516528561254562
We can also calculate the score of a single task by specifying this task:
grid.score(test_X, test_y, 'Flow')
-0.163777031780044
Extensions
The standard ORT extensions (e.g. hyperplane splits, linear regression) are also available in the multi-task setting and controlled in the usual way.
For instance, we can use Optimal Regression Trees with hyperplane splits:
grid = iai.GridSearch(
iai.OptimalTreeMultiRegressor(
random_seed=1,
max_depth=2,
hyperplane_config={'sparsity': 'all'}
),
)
grid.fit(train_X, train_y)
grid.get_learner()