Data Preparation Guide
All Interpretable AI methods handle data in the same format. There is a matrix of features X
, and target(s) y
. It is also possible to specify the weight to be given to each point while fitting models and evaluating performance with sample_weight
.
Features
The feature matrix X
can be supplied in two ways:
- a
Matrix
containing all numeric entries, or - a
DataFrame
(see DataFrames.jl)
We recommend supplying the features as a DataFrame
since this supports more functionality like categorical data and missing values. If you are familiar with dataframes in R and pandas in Python, then DataFrames in Julia will feel very familiar.
You can read in CSV files using the CSV.jl package and converting the result to a DataFrame
:
using CSV, DataFrames
df = CSV.read("iris.csv", DataFrame)
150×5 DataFrame
Row │ SepalLength SepalWidth PetalLength PetalWidth Species
│ Float64 Float64 Float64 Float64 String
─────┼─────────────────────────────────────────────────────────────
1 │ 5.1 3.5 1.4 0.2 setosa
2 │ 4.9 3.0 1.4 0.2 setosa
3 │ 4.7 3.2 1.3 0.2 setosa
4 │ 4.6 3.1 1.5 0.2 setosa
5 │ 5.0 3.6 1.4 0.2 setosa
6 │ 5.4 3.9 1.7 0.4 setosa
7 │ 4.6 3.4 1.4 0.3 setosa
8 │ 5.0 3.4 1.5 0.2 setosa
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
144 │ 6.8 3.2 5.9 2.3 virginica
145 │ 6.7 3.3 5.7 2.5 virginica
146 │ 6.7 3.0 5.2 2.3 virginica
147 │ 6.3 2.5 5.0 1.9 virginica
148 │ 6.5 3.0 5.2 2.0 virginica
149 │ 6.2 3.4 5.4 2.3 virginica
150 │ 5.9 3.0 5.1 1.8 virginica
135 rows omitted
Numerical features
A column of all numeric variables will be treated as numeric by default.
Categorical features
Categorical features are represented using either a CategoricalVector
(see CategoricalArrays.jl) or a PooledVector
(see PooledArrays.jl). There are a number of ways to mark features in the data as categorical:
Construct the column with
CategoricalVector
:using CategoricalArrays # In the case where the column name is known df.Species = CategoricalArray(df.Species)
# In the case where the column name is a variable colname = :Species df[!, colname] = CategoricalArray(df[!, colname])
Convert a column of a
DataFrame
into a categorical feature:transform!(df, :Species => categorical, renamecols=false)
Automatically convert any string columns to categorical when reading the data:
df = CSV.read("iris.csv", DataFrame, pool=true)
Note that CSV.jl pools
String
columns with fewer than 10% unique values by default, so string columns with a small number of unique values will be treated as categorical. You can override this behavior by specifying a value forpool
.
Ordinal features
Ordinal features are categorical features whose potential values are ordered. These features are also represented using a CategoricalVector
, except the levels in the vector are marked as ordered when constructing the vector:
df.Species = CategoricalArray(df.Species, ordered=true)
By default the order of the levels in the feature will be the standard sort order. You can inspect the current order with levels
:
levels(df.Species)
3-element Vector{String}:
"setosa"
"versicolor"
"virginica"
You can change the ordering with levels!
:
levels!(df.Species, ["setosa", "virginica", "versicolor"])
Mixed features
Interpretable AI algorithms also support an additional type of feature that consists of both numeric/ordinal and categoric values, which we refer to as a "mixed" feature. For instance, we might have the possible options for a feature "age":
- the age in years as a numeric value
- "Did not provide", indicating the person did not share their age
missing
, indicating the age was not recorded
Note that "Did not provide" and missing
are distinct reasons for having no numeric values, and there can be value in handling these separately.
df.Age = repeat([1, 2, 3, "Did not provide", missing], Int(size(df, 1) / 5))
150-element Vector{Any}:
1
2
3
"Did not provide"
missing
1
2
3
"Did not provide"
missing
⋮
2
3
"Did not provide"
missing
1
2
3
"Did not provide"
missing
Notice that the type of this vector is Any
. If we just pass in this data with no extra information, it is not clear to the algorithm how it should be handled. We represent these features by constructing a special vector using the make_mixed_data
function on the relevant column in the DataFrame
.
df.Age = IAI.make_mixed_data(df.Age)
150-element Vector{Union{Missing, NumericMixedDatum}}:
1.0
2.0
3.0
"Did not provide"
missing
1.0
2.0
3.0
"Did not provide"
missing
⋮
2.0
3.0
"Did not provide"
missing
1.0
2.0
3.0
"Did not provide"
missing
We can undo this transformation and recover the original column if needed with the undo_mixed_data
function.
Similarly, we might have a column in the data that is a mix of categoric and ordinal features. We might have a feature that consists of a letter grade (D through A), as well as permitting "Not graded" and missing
.
df.Grade = repeat(["A", "B", "C", "Not graded", missing], Int(nrow(df) / 5))
150-element Vector{Union{Missing, String}}:
"A"
"B"
"C"
"Not graded"
missing
"A"
"B"
"C"
"Not graded"
missing
⋮
"B"
"C"
"Not graded"
missing
"A"
"B"
"C"
"Not graded"
missing
As before, we use make_mixed_data
to signal to the algorithm to treat this column as a mix of categoric and ordinal values. This time we have to specify the levels of the ordinal values in increasing order.
df.Grade = IAI.make_mixed_data(df.Grade, ["C", "B", "A"])
150-element Vector{Union{Missing, OrdinalMixedDatum{String}}}:
"A"
"B"
"C"
"Not graded"
missing
"A"
"B"
"C"
"Not graded"
missing
⋮
"B"
"C"
"Not graded"
missing
"A"
"B"
"C"
"Not graded"
missing
Missingness
Julia represents missing values with the value missing
. It is possible to set values to missing
directly in a DataFrame
. However, if the type of the column in the DataFrame
does not support missing values, you will see the following error:
df[1:5, :SepalLength] .= missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64
In this case, you can call allowmissing!
first to permit missing values in the DataFrame
, and then set the values to missing
:
allowmissing!(df, :SepalLength)
df[1:5, :SepalLength] .= missing
You can also use the missingstrings
keyword argument to CSV.read
to specify the strings that denote missing values:
df = CSV.read("iris.csv", DataFrame, missingstring=["setosa"])
Target
We categorize problems into different "tasks" based on the number and type of the target arguments.
Classification
The target for classification problems is a single vector y
containing the label for each point.
By default, the labels will be displayed in all outputs using the normal sorting order for the values. If you would like to specify the order in which to display the labels, you can pass in y
as an ordered CategoricalVector
as described in Ordinal features, and the labels will be displayed according to the level order of y
.
Regression
The target for regression problems is a single numeric vector y
containing the value for each point.
Prescription
The target for prescription problems is made up of two vectors:
- a vector
treatments
containing the treatment labels for each point (same format as a classification target) - a vector
outcomes
containing the numeric outcomes for each point (same format as a regression target)
Policy
The target for policy problems, rewards
, is either a single numeric Matrix
or DataFrame
with all numeric columns. If using a Dataframe
, the column names indicate the names of treatment options.
In practice, these rewards will often be estimated from observational data in the same format as prescription problems (treatments
and outcomes
). This can be done using RewardEstimation, refer to the Optimal Policy Trees quickstart guide for an example.
Survival
Survival problems deal with trying to predict the distribution of occurences of some event, for example:
- the survival time of patients following some treatment or intervention (the traditional survival analysis context - see the Optimal Survival Trees quickstart for an example)
- the time-to-failure of machines or other equipment (in the context of predictive maintenance - see the Turbofan case study for an example)
- the total value of claims paid out by an individual's insurance policy in a given year
There are two key differences to a regression problem:
- The focus is on understanding the distribution of outcomes, not simply making point estimates
- Survival problems can handle censored data, where the outcome value may not be observed for a given record, but a lower bound is known. For instance, if a patient visits their doctor at time $t$, then even if we do not know their eventual survival time, we do know they survived until at least time $t$.
The target for survival problems is made up of two vectors (whose names reflect the traditional survival analysis context):
- a boolean vector
deaths
indicating whether each point was a death (true
) or a censored (false
) observation - a numeric vector
times
containing the observation time for each point
Imputation
There is no target for imputation problems as the problem is unsupervised.
Multi-task Problems
For problems with multiple tasks (e.g. multiple classification targets), the target should be supplied in one of the following forms:
- A
DataFrame
with each column containing a separate target vector. The name of each column will be used as the label for the respective task. - A matrix where each column is a separate target vector. The task labels will be automatically generated as
y1
,y2
,y3
, etc.
Sample Weights
Many of the IAI functions accept a sample_weight
keyword argument allowing you to specify different weights in the scoring criterion for each point in the data. There are multiple ways you can use this argument to specify the weights:
nothing
(default) will assign equal weight to all points- supply a
Vector
orStatsBase.Weights
containing the weights to use for each point
For classification and prescription problems, the following options are also available to specify weights based on the label of each point (the class for classification and the treatment for prescription):
supply a
Dict
indicating the weight to assign to each label, e.g. to assign twice the weight to points of type "A" over type "B", we can useDict("A" => 2, "B" => 1)
:autobalance
will choose weights that balance the label distribution, so that the total weight for each label is the same
For information on when you might want to use sample weights, see tips and tricks.
Data Splitting
We can split the data into training and testing with the split_data
function by specifying the task type, the features and targets:
X = df[:, 1:4]
y = df[:, 5]
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y)
length(train_y), length(test_y)
(105, 45)
The task type controls how the data is split:
- a stratified split on the class labels for
:classification
- a stratified split on the treatments for
:prescription_maximize
or:prescription_minimize
- a random split, otherwise
We can optionally control the relative size of the two parts with the train_proportion
keyword:
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
train_proportion=0.5)
length(train_y), length(test_y)
(75, 75)