Quick Start Guide: Multi-task Optimal Classification Trees

This is a Python version of the corresponding OptimalTrees quick start guide.

In this example we will use Optimal Classification Trees (OCT) on the acute inflammations dataset to solve a multi-task classification problem.

This guide assumes you are familiar with OCTs and focuses on aspects that are unique to the multi-task setting. For a general introduction to OCTs, please refer to the OCT quickstart guide.

First we load in the data and split into training and test datasets:

import pandas as pd
df = pd.read_csv('diagnosis.data',
    encoding='UTF-16',
    sep="\t",
    decimal=',',
    names=['temp', 'nausea', 'lumbar_pain', 'urine_pushing',
           'micturition_pains', 'burning', 'inflammation', 'nephritis'],
)

for col in df.columns:
    if df.dtypes[col] == 'object':
        df[col] = df[col].astype('category')
     temp nausea lumbar_pain  ... burning inflammation nephritis
0    35.5     no         yes  ...      no           no        no
1    35.9     no          no  ...     yes          yes        no
2    35.9     no         yes  ...      no           no        no
3    36.0     no          no  ...     yes          yes        no
4    36.0     no         yes  ...      no           no        no
5    36.0     no         yes  ...      no           no        no
6    36.2     no          no  ...     yes          yes        no
..    ...    ...         ...  ...     ...          ...       ...
113  41.2     no         yes  ...     yes           no       yes
114  41.3    yes         yes  ...      no          yes       yes
115  41.4     no         yes  ...     yes           no       yes
116  41.5     no          no  ...      no           no        no
117  41.5    yes         yes  ...      no           no       yes
118  41.5     no         yes  ...     yes           no       yes
119  41.5     no         yes  ...     yes           no       yes

[120 rows x 8 columns]

The goal is to predict two diseases of the urinary system: acute inflammations of urinary bladder and acute nephritises. We therefore separate these two targets from the rest of the features, and split for training and testing:

from interpretableai import iai
targets = ['inflammation', 'nephritis']
X = df.drop(targets, axis=1)
y = df[targets]
(train_X, train_y), (test_X, test_y) = iai.split_data('multi_classification',
                                                      X, y, seed=1)

Multi-task Optimal Classification Trees

We will use a GridSearch to fit an OptimalTreeMultiClassifier:

grid = iai.GridSearch(
    iai.OptimalTreeMultiClassifier(
        random_seed=1,
    ),
    max_depth=range(1, 6),
)
grid.fit(train_X, train_y)
grid.get_learner()
Optimal Trees Visualization

We can make predictions on new data using predict:

grid.predict(test_X)
{'inflammation': ['no', 'no', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'yes', 'no'], 'nephritis': ['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes']}

This returns a dictionary containing the predictions for each of the tasks, and can also be converted to a dataframe easily:

pd.DataFrame(grid.predict(test_X))
   inflammation nephritis
0            no        no
1            no        no
2           yes        no
3            no        no
4           yes        no
5            no        no
6           yes        no
..          ...       ...
29          yes       yes
30           no       yes
31           no       yes
32          yes       yes
33           no       yes
34          yes       yes
35           no       yes

[36 rows x 2 columns]

We can also generate the predictions for a specific task by passing the task label:

grid.predict(test_X, 'inflammation')
['no', 'no', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'yes', 'no']

We can evaluate the quality of the tree using score with any of the supported loss functions. For multi-task problems, the returned score is the average of the scores of the individual tasks:

grid.score(test_X, test_y, criterion='misclassification')
1.0

We can also calculate the score of a single task by specifying this task:

grid.score(test_X, test_y, 'nephritis', criterion='auc')
1.0

The other standard API functions (e.g. predict_proba, ROCCurve) can be called as normal. As above, by default they will generate output for all tasks, and a task can be specified to return information for a single task.

Extensions

The standard OCT extensions (e.g. hyperplane splits, logistic regression) are also available in the multi-task setting and controlled in the usual way.

For instance, we can use Optimal Classification Trees with hyperplane splits:

grid = iai.GridSearch(
    iai.OptimalTreeMultiClassifier(
        random_seed=1,
        max_depth=2,
        hyperplane_config={'sparsity': 'all'}
    ),
)
grid.fit(train_X, train_y)
grid.get_learner()
Optimal Trees Visualization