Quick Start Guide: Multi-task Optimal Classification Trees

This is an R version of the corresponding OptimalTrees quick start guide.

In this example we will use Optimal Classification Trees (OCT) on the acute inflammations dataset to solve a multi-task classification problem.

This guide assumes you are familiar with OCTs and focuses on aspects that are unique to the multi-task setting. For a general introduction to OCTs, please refer to the OCT quickstart guide.

First we load in the data and split into training and test datasets:

df <- read.table("diagnosis.data",
    sep = "\t",
    fileEncoding = "UTF-16LE",
    dec = ",",
    col.names = c("temp", "nausea", "lumbar_pain", "urine_pushing",
                  "micturition_pains", "burning", "inflammation", "nephritis"),
    stringsAsFactors = T
)

  temp nausea lumbar_pain urine_pushing micturition_pains burning inflammation
1 35.5     no         yes            no                no      no           no
2 35.9     no          no           yes               yes     yes          yes
3 35.9     no         yes            no                no      no           no
4 36.0     no          no           yes               yes     yes          yes
5 36.0     no         yes            no                no      no           no
6 36.0     no         yes            no                no      no           no
7 36.2     no          no           yes               yes     yes          yes
  nephritis
1        no
2        no
3        no
4        no
5        no
6        no
7        no
 [ reached 'max' / getOption("max.print") -- omitted 113 rows ]

The goal is to predict two diseases of the urinary system: acute inflammations of urinary bladder and acute nephritises. We therefore separate these two targets from the rest of the features, and split for training and testing:

targets = c("inflammation", "nephritis")
X <- df[, !names(df) %in% targets]
y <- df[targets]
split <- iai::split_data("multi_classification", X, y, seed = 1)
train_X <- split$train$X
train_y <- split$train$y
test_X <- split$test$X
test_y <- split$test$y

Multi-task Optimal Classification Trees

We will use a grid_search to fit an optimal_tree_multi_classifier:

grid <- iai::grid_search(
    iai::optimal_tree_multi_classifier(
        random_seed = 1,
    ),
    max_depth = 1:5,
)
iai::fit(grid, train_X, train_y)
iai::get_learner(grid)

Optimal Trees Visualization

We can make predictions on new data using predict:

iai::predict(grid, test_X)

$inflammation
 [1] "no"  "no"  "yes" "no"  "yes" "no"  "yes" "yes" "no"  "yes" "yes" "no"
[13] "yes" "yes" "no"  "yes" "yes" "yes" "yes" "no"  "no"  "no"  "no"  "yes"
[25] "no"  "yes" "no"  "no"  "yes" "yes" "no"  "no"  "yes" "no"  "yes" "no"

$nephritis
 [1] "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"  "no"
[13] "no"  "no"  "no"  "no"  "no"  "no"  "no"  "yes" "yes" "no"  "yes" "yes"
[25] "yes" "yes" "no"  "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes"

This returns a list containing the predictions for each of the tasks, and can also be converted to a dataframe easily:

as.data.frame(iai::predict(grid, test_X))

   inflammation nephritis
1            no        no
2            no        no
3           yes        no
4            no        no
5           yes        no
6            no        no
7           yes        no
8           yes        no
9            no        no
10          yes        no
11          yes        no
12           no        no
13          yes        no
14          yes        no
15           no        no
16          yes        no
17          yes        no
18          yes        no
19          yes        no
20           no       yes
21           no       yes
22           no        no
23           no       yes
24          yes       yes
25           no       yes
26          yes       yes
27           no        no
28           no       yes
29          yes       yes
30          yes       yes
 [ reached 'max' / getOption("max.print") -- omitted 6 rows ]

We can also generate the predictions for a specific task by passing the task label:

iai::predict(grid, test_X, "inflammation")

 [1] "no"  "no"  "yes" "no"  "yes" "no"  "yes" "yes" "no"  "yes" "yes" "no"
[13] "yes" "yes" "no"  "yes" "yes" "yes" "yes" "no"  "no"  "no"  "no"  "yes"
[25] "no"  "yes" "no"  "no"  "yes" "yes" "no"  "no"  "yes" "no"  "yes" "no"

We can evaluate the quality of the tree using score with any of the supported loss functions. For multi-task problems, the returned score is the average of the scores of the individual tasks:

iai::score(grid, test_X, test_y, criterion = "misclassification")

[1] 1

We can also calculate the score of a single task by specifying this task:

iai::score(grid, test_X, test_y, "nephritis", criterion = "auc")

[1] 1

The other standard API functions (e.g. predict_proba, roc_curve) can be called as normal. As above, by default they will generate output for all tasks, and a task can be specified to return information for a single task.

Extensions

The standard OCT extensions (e.g. hyperplane splits, logistic regression) are also available in the multi-task setting and controlled in the usual way.

For instance, we can use Optimal Classification Trees with hyperplane splits:

grid <- iai::grid_search(
    iai::optimal_tree_multi_classifier(
        random_seed = 1,
        max_depth = 2,
        hyperplane_config = list(sparsity = "all"),
    ),
)
iai::fit(grid, train_X, train_y)
iai::get_learner(grid)

Optimal Trees Visualization