Data Preparation Guide

All Interpretable AI methods handle data in the same format. There is a matrix of features X, and target(s) y. It is also possible to specify the weight to be given to each point while fitting models and evaluating performance with sample_weight.

Features

The feature matrix X can be supplied in two ways:

  • a Matrix containing all numeric entries, or
  • a DataFrame (see DataFrames.jl)

We recommend supplying the features as a DataFrame since this supports more functionality like categorical data and missing values. If you are familiar with dataframes in R and pandas in Python, then DataFrames in Julia will feel very familiar.

You can read in CSV files using the CSV.jl package and converting the result to a DataFrame:

using CSV, DataFrames
df = CSV.read("iris.csv", DataFrame)
150×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
     │ Float64      Float64     Float64      Float64     String
─────┼─────────────────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2  setosa
   2 │         4.9         3.0          1.4         0.2  setosa
   3 │         4.7         3.2          1.3         0.2  setosa
   4 │         4.6         3.1          1.5         0.2  setosa
   5 │         5.0         3.6          1.4         0.2  setosa
   6 │         5.4         3.9          1.7         0.4  setosa
   7 │         4.6         3.4          1.4         0.3  setosa
   8 │         5.0         3.4          1.5         0.2  setosa
  ⋮  │      ⋮           ⋮            ⋮           ⋮           ⋮
 144 │         6.8         3.2          5.9         2.3  virginica
 145 │         6.7         3.3          5.7         2.5  virginica
 146 │         6.7         3.0          5.2         2.3  virginica
 147 │         6.3         2.5          5.0         1.9  virginica
 148 │         6.5         3.0          5.2         2.0  virginica
 149 │         6.2         3.4          5.4         2.3  virginica
 150 │         5.9         3.0          5.1         1.8  virginica
                                                   135 rows omitted

Numerical features

A column of all numeric variables will be treated as numeric by default.

Categorical features

Categorical features are represented using either a CategoricalVector (see CategoricalArrays.jl) or a PooledVector (see PooledArrays.jl). There are a number of ways to mark features in the data as categorical:

  • Construct the column with CategoricalVector:

    using CategoricalArrays
    # In the case where the column name is known
    df.Species = CategoricalArray(df.Species)
    # In the case where the column name is a variable
    colname = :Species
    df[!, colname] = CategoricalArray(df[!, colname])
  • Convert a column of a DataFrame into a categorical feature:

    transform!(df, :Species => categorical, renamecols=false)
  • Automatically convert any string columns to categorical when reading the data:

    df = CSV.read("iris.csv", DataFrame, pool=true)

    Note that CSV.jl pools String columns with fewer than 10% unique values by default, so string columns with a small number of unique values will be treated as categorical. You can override this behavior by specifying a value for pool.

Ordinal features

Ordinal features are categorical features whose potential values are ordered. These features are also represented using a CategoricalVector, except the levels in the vector are marked as ordered when constructing the vector:

df.Species = CategoricalArray(df.Species, ordered=true)

By default the order of the levels in the feature will be the standard sort order. You can inspect the current order with levels:

levels(df.Species)
3-element Vector{String}:
 "setosa"
 "versicolor"
 "virginica"

You can change the ordering with levels!:

levels!(df.Species, ["setosa", "virginica", "versicolor"])

Mixed features

Interpretable AI algorithms also support an additional type of feature that consists of both numeric/ordinal and categoric values, which we refer to as a "mixed" feature. For instance, we might have the possible options for a feature "age":

  • the age in years as a numeric value
  • "Did not provide", indicating the person did not share their age
  • missing, indicating the age was not recorded

Note that "Did not provide" and missing are distinct reasons for having no numeric values, and there can be value in handling these separately.

df.Age = repeat([1, 2, 3, "Did not provide", missing], Int(size(df, 1) / 5))
150-element Vector{Any}:
 1
 2
 3
  "Did not provide"
  missing
 1
 2
 3
  "Did not provide"
  missing
 ⋮
 2
 3
  "Did not provide"
  missing
 1
 2
 3
  "Did not provide"
  missing

Notice that the type of this vector is Any. If we just pass in this data with no extra information, it is not clear to the algorithm how it should be handled. We represent these features by constructing a special vector using the make_mixed_data function on the relevant column in the DataFrame.

df.Age = IAI.make_mixed_data(df.Age)
150-element Vector{Union{Missing, NumericMixedDatum}}:
 1.0
 2.0
 3.0
 "Did not provide"
 missing
 1.0
 2.0
 3.0
 "Did not provide"
 missing
 ⋮
 2.0
 3.0
 "Did not provide"
 missing
 1.0
 2.0
 3.0
 "Did not provide"
 missing

We can undo this transformation and recover the original column if needed with the undo_mixed_data function.

Similarly, we might have a column in the data that is a mix of categoric and ordinal features. We might have a feature that consists of a letter grade (D through A), as well as permitting "Not graded" and missing.

df.Grade = repeat(["A", "B", "C", "Not graded", missing], Int(nrow(df) / 5))
150-element Vector{Union{Missing, String}}:
 "A"
 "B"
 "C"
 "Not graded"
 missing
 "A"
 "B"
 "C"
 "Not graded"
 missing
 ⋮
 "B"
 "C"
 "Not graded"
 missing
 "A"
 "B"
 "C"
 "Not graded"
 missing

As before, we use make_mixed_data to signal to the algorithm to treat this column as a mix of categoric and ordinal values. This time we have to specify the levels of the ordinal values in increasing order.

df.Grade = IAI.make_mixed_data(df.Grade, ["C", "B", "A"])
150-element Vector{Union{Missing, OrdinalMixedDatum{String}}}:
 "A"
 "B"
 "C"
 "Not graded"
 missing
 "A"
 "B"
 "C"
 "Not graded"
 missing
 ⋮
 "B"
 "C"
 "Not graded"
 missing
 "A"
 "B"
 "C"
 "Not graded"
 missing

Missingness

Julia represents missing values with the value missing. It is possible to set values to missing directly in a DataFrame. However, if the type of the column in the DataFrame does not support missing values, you will see the following error:

df[1:5, :SepalLength] .= missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

In this case, you can call allowmissing! first to permit missing values in the DataFrame, and then set the values to missing:

allowmissing!(df, :SepalLength)
df[1:5, :SepalLength] .= missing

You can also use the missingstrings keyword argument to CSV.read to specify the strings that denote missing values:

df = CSV.read("iris.csv", DataFrame, missingstring=["setosa"])

Target

We categorize problems into different "tasks" based on the number and type of the target arguments.

Classification

The target for classification problems is a single vector y containing the label for each point.

By default, the labels will be displayed in all outputs using the normal sorting order for the values. If you would like to specify the order in which to display the labels, you can pass in y as an ordered CategoricalVector as described in Ordinal features, and the labels will be displayed according to the level order of y.

Regression

The target for regression problems is a single numeric vector y containing the value for each point.

Prescription

The target for prescription problems is made up of two vectors:

  • a vector treatments containing the treatment labels for each point (same format as a classification target)
  • a vector outcomes containing the numeric outcomes for each point (same format as a regression target)

Policy

The target for policy problems, rewards, is either a single numeric Matrix or DataFrame with all numeric columns. If using a Dataframe, the column names indicate the names of treatment options.

In practice, these rewards will often be estimated from observational data in the same format as prescription problems (treatments and outcomes). This can be done using RewardEstimation, refer to the Optimal Policy Trees quickstart guide for an example.

Survival

Survival problems deal with trying to predict the distribution of occurences of some event, for example:

  • the survival time of patients following some treatment or intervention (the traditional survival analysis context - see the Optimal Survival Trees quickstart for an example)
  • the time-to-failure of machines or other equipment (in the context of predictive maintenance - see the Turbofan case study for an example)
  • the total value of claims paid out by an individual's insurance policy in a given year

There are two key differences to a regression problem:

  1. The focus is on understanding the distribution of outcomes, not simply making point estimates
  2. Survival problems can handle censored data, where the outcome value may not be observed for a given record, but a lower bound is known. For instance, if a patient visits their doctor at time $t$, then even if we do not know their eventual survival time, we do know they survived until at least time $t$.

The target for survival problems is made up of two vectors (whose names reflect the traditional survival analysis context):

  • a boolean vector deaths indicating whether each point was a death (true) or a censored (false) observation
  • a numeric vector times containing the observation time for each point

Imputation

There is no target for imputation problems as the problem is unsupervised.

Multi-task Problems

For problems with multiple tasks (e.g. multiple classification targets), the target should be supplied in one of the following forms:

  • A DataFrame with each column containing a separate target vector. The name of each column will be used as the label for the respective task.
  • A matrix where each column is a separate target vector. The task labels will be automatically generated as y1, y2, y3, etc.

Sample Weights

Many of the IAI functions accept a sample_weight keyword argument allowing you to specify different weights in the scoring criterion for each point in the data. There are multiple ways you can use this argument to specify the weights:

  • nothing (default) will assign equal weight to all points
  • supply a Vector or StatsBase.Weights containing the weights to use for each point

For classification and prescription problems, the following options are also available to specify weights based on the label of each point (the class for classification and the treatment for prescription):

  • supply a Dict indicating the weight to assign to each label, e.g. to assign twice the weight to points of type "A" over type "B", we can use Dict("A" => 2, "B" => 1)

  • :autobalance will choose weights that balance the label distribution, so that the total weight for each label is the same

For information on when you might want to use sample weights, see tips and tricks.

Data Splitting

We can split the data into training and testing with the split_data function by specifying the task type, the features and targets:

X = df[:, 1:4]
y = df[:, 5]
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y)
length(train_y), length(test_y)
(105, 45)

The task type controls how the data is split:

  • a stratified split on the class labels for :classification
  • a stratified split on the treatments for :prescription_maximize or :prescription_minimize
  • a random split, otherwise

We can optionally control the relative size of the two parts with the train_proportion keyword:

(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
                                                      train_proportion=0.5)
length(train_y), length(test_y)
(75, 75)