Quick Start Guide: Multi-task Optimal Classification Trees
This is a Python version of the corresponding OptimalTrees quick start guide.
In this example we will use Optimal Classification Trees (OCT) on the acute inflammations dataset to solve a multi-task classification problem.
This guide assumes you are familiar with OCTs and focuses on aspects that are unique to the multi-task setting. For a general introduction to OCTs, please refer to the OCT quickstart guide.
First we load in the data and split into training and test datasets:
import pandas as pd
df = pd.read_csv('diagnosis.data',
encoding='UTF-16',
sep="\t",
decimal=',',
names=['temp', 'nausea', 'lumbar_pain', 'urine_pushing',
'micturition_pains', 'burning', 'inflammation', 'nephritis'],
)
for col in df.columns:
if df.dtypes[col] == 'object':
df[col] = df[col].astype('category')
temp nausea lumbar_pain ... burning inflammation nephritis
0 35.5 no yes ... no no no
1 35.9 no no ... yes yes no
2 35.9 no yes ... no no no
3 36.0 no no ... yes yes no
4 36.0 no yes ... no no no
5 36.0 no yes ... no no no
6 36.2 no no ... yes yes no
.. ... ... ... ... ... ... ...
113 41.2 no yes ... yes no yes
114 41.3 yes yes ... no yes yes
115 41.4 no yes ... yes no yes
116 41.5 no no ... no no no
117 41.5 yes yes ... no no yes
118 41.5 no yes ... yes no yes
119 41.5 no yes ... yes no yes
[120 rows x 8 columns]
The goal is to predict two diseases of the urinary system: acute inflammations of urinary bladder and acute nephritises. We therefore separate these two targets from the rest of the features, and split for training and testing:
from interpretableai import iai
targets = ['inflammation', 'nephritis']
X = df.drop(targets, axis=1)
y = df[targets]
(train_X, train_y), (test_X, test_y) = iai.split_data('multi_classification',
X, y, seed=1)
Multi-task Optimal Classification Trees
We will use a GridSearch
to fit an OptimalTreeMultiClassifier
:
grid = iai.GridSearch(
iai.OptimalTreeMultiClassifier(
random_seed=1,
),
max_depth=range(1, 6),
)
grid.fit(train_X, train_y)
grid.get_learner()
We can make predictions on new data using predict
:
grid.predict(test_X)
{'inflammation': ['no', 'no', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'yes', 'no'], 'nephritis': ['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes']}
This returns a dictionary containing the predictions for each of the tasks, and can also be converted to a dataframe easily:
pd.DataFrame(grid.predict(test_X))
inflammation nephritis
0 no no
1 no no
2 yes no
3 no no
4 yes no
5 no no
6 yes no
.. ... ...
29 yes yes
30 no yes
31 no yes
32 yes yes
33 no yes
34 yes yes
35 no yes
[36 rows x 2 columns]
We can also generate the predictions for a specific task by passing the task label:
grid.predict(test_X, 'inflammation')
['no', 'no', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'yes', 'no']
We can evaluate the quality of the tree using score
with any of the supported loss functions. For multi-task problems, the returned score is the average of the scores of the individual tasks:
grid.score(test_X, test_y, criterion='misclassification')
1.0
We can also calculate the score of a single task by specifying this task:
grid.score(test_X, test_y, 'nephritis', criterion='auc')
1.0
The other standard API functions (e.g. predict_proba
, ROCCurve
) can be called as normal. As above, by default they will generate output for all tasks, and a task can be specified to return information for a single task.
Extensions
The standard OCT extensions (e.g. hyperplane splits, logistic regression) are also available in the multi-task setting and controlled in the usual way.
For instance, we can use Optimal Classification Trees with hyperplane splits:
grid = iai.GridSearch(
iai.OptimalTreeMultiClassifier(
random_seed=1,
max_depth=2,
hyperplane_config={'sparsity': 'all'}
),
)
grid.fit(train_X, train_y)
grid.get_learner()