Quick Start Guide: Multi-task Optimal Regression Trees

This is a Python version of the corresponding OptimalTrees quick start guide.

In this example we will use Optimal Regression Trees (OCT) on the concrete slump test dataset to solve a multi-task regression problem.

This guide assumes you are familiar with ORTs and focuses on aspects that are unique to the multi-task setting. For a general introduction to ORTs, please refer to the ORT quickstart guide.

First we load in the data and split into training and test datasets:

import pandas as pd
df = pd.read_csv('slump_test.data')

      No  Cement  ...  FLOW(cm)  Compressive Strength (28-day)(Mpa)
0      1   273.0  ...      62.0                               34.99
1      2   163.0  ...      20.0                               41.14
2      3   162.0  ...      20.0                               41.81
3      4   162.0  ...      21.5                               42.08
4      5   154.0  ...      64.0                               26.82
5      6   147.0  ...      55.0                               25.21
6      7   152.0  ...      20.0                               38.86
..   ...     ...  ...       ...                                 ...
96    97   215.6  ...      64.0                               39.13
97    98   295.3  ...      77.0                               44.08
98    99   248.3  ...      20.0                               49.97
99   100   248.0  ...      20.0                               50.23
100  101   258.8  ...      20.0                               50.50
101  102   297.1  ...      67.0                               49.17
102  103   348.7  ...      78.0                               48.77

[103 rows x 11 columns]

The goal is to predict three characteristics of concrete from other properties. We therefore separate these targets from the rest of the features, and split for training and testing:

from interpretableai import iai
df = df.rename(columns={
    "SLUMP(cm)": "Slump",
    "FLOW(cm)": "Flow",
    "Compressive Strength (28-day)(Mpa)": "Strength",
})
targets = ['Slump', 'Flow', 'Strength']
X = df.drop(['No', *targets], axis=1)
y = df[targets]
(train_X, train_y), (test_X, test_y) = iai.split_data('multi_regression',
                                                      X, y, seed=1)

Multi-task Optimal Regression Trees

We will use a GridSearch to fit an OptimalTreeMultiRegressor:

grid = iai.GridSearch(
    iai.OptimalTreeMultiRegressor(
        random_seed=1,
    ),
    max_depth=range(1, 6),
)
grid.fit(train_X, train_y)
grid.get_learner()

Optimal Trees Visualization

We can make predictions on new data using predict:

grid.predict(test_X)

{'Slump': array([ 8.7       , 20.55172414, 20.55172414, ...,  0.5       ,
       20.55172414, 20.55172414]), 'Flow': array([27.45, 54.4 , 54.4 , ..., 20.  , 54.4 , 54.4 ]), 'Strength': array([30.087     , 35.01775862, 35.01775862, ..., 45.6875    ,
       35.01775862, 35.01775862])}

This returns a dictionary containing the predictions for each of the tasks, and can also be converted to a dataframe easily:

pd.DataFrame(grid.predict(test_X))

        Slump   Flow   Strength
0    8.700000  27.45  30.087000
1   20.551724  54.40  35.017759
2   20.551724  54.40  35.017759
3   20.551724  54.40  35.017759
4   20.551724  54.40  35.017759
5   20.551724  54.40  35.017759
6   20.551724  54.40  35.017759
..        ...    ...        ...
24  20.551724  54.40  35.017759
25  20.551724  54.40  35.017759
26   0.500000  20.00  45.687500
27  20.551724  54.40  35.017759
28   0.500000  20.00  45.687500
29  20.551724  54.40  35.017759
30  20.551724  54.40  35.017759

[31 rows x 3 columns]

We can also generate the predictions for a specific task by passing the task label:

grid.predict(test_X, 'Slump')

array([ 8.7       , 20.55172414, 20.55172414, ...,  0.5       ,
       20.55172414, 20.55172414])

We can evaluate the quality of the tree using score with any of the supported loss functions. For multi-task problems, the returned score is the average of the scores of the individual tasks:

grid.score(test_X, test_y)

-0.1516528561254562

We can also calculate the score of a single task by specifying this task:

grid.score(test_X, test_y, 'Flow')

-0.163777031780044

Extensions

The standard ORT extensions (e.g. hyperplane splits, linear regression) are also available in the multi-task setting and controlled in the usual way.

For instance, we can use Optimal Regression Trees with hyperplane splits:

grid = iai.GridSearch(
    iai.OptimalTreeMultiRegressor(
        random_seed=1,
        max_depth=2,
        hyperplane_config={'sparsity': 'all'}
    ),
)
grid.fit(train_X, train_y)
grid.get_learner()

Optimal Trees Visualization