# Data Preparation Guide

All Interpretable AI methods handle data in the same format. There is a matrix of features X, and target(s) y. It is also possible to specify the weight to be given to each point while fitting models and evaluating performance with sample_weight.

## Features

The feature matrix X can be supplied in two ways:

• a Matrix containing all numeric entries, or
• a DataFrame (see DataFrames.jl)

We recommend supplying the features as a DataFrame since it supports more features like categorical data and missing values. If you are familiar with dataframes in R and pandas in Python, then DataFrames in Julia will feel very familiar.

You can read in CSV files using the CSV.jl package, which automatically returns a DataFrame:

using CSV, DataFrames
df = CSV.read("iris.csv")
150×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species   │
│     │ Float64     │ Float64    │ Float64     │ Float64    │ String    │
├─────┼─────────────┼────────────┼─────────────┼────────────┼───────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa    │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa    │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa    │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa    │
│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa    │
│ 6   │ 5.4         │ 3.9        │ 1.7         │ 0.4        │ setosa    │
│ 7   │ 4.6         │ 3.4        │ 1.4         │ 0.3        │ setosa    │
⋮
│ 143 │ 5.8         │ 2.7        │ 5.1         │ 1.9        │ virginica │
│ 144 │ 6.8         │ 3.2        │ 5.9         │ 2.3        │ virginica │
│ 145 │ 6.7         │ 3.3        │ 5.7         │ 2.5        │ virginica │
│ 146 │ 6.7         │ 3.0        │ 5.2         │ 2.3        │ virginica │
│ 147 │ 6.3         │ 2.5        │ 5.0         │ 1.9        │ virginica │
│ 148 │ 6.5         │ 3.0        │ 5.2         │ 2.0        │ virginica │
│ 149 │ 6.2         │ 3.4        │ 5.4         │ 2.3        │ virginica │
│ 150 │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ virginica │

### Numerical features

A column of all numeric variables will be treated as numeric by default.

### Categorical features

Categorical features are represented using a CategoricalVector (see CategoricalArrays.jl). There are a number of ways to mark features in the data as categorical:

• Construct the column with CategoricalVector:

df.Species = CategoricalArray(df.Species)
• Convert a column of a DataFrame into a categorical feature:

categorical!(df, :Species)
• Automatically convert any string columns to categorical when reading the data:

df = CSV.read("iris.csv", categorical=true, copycols=true)

### Ordinal features

Ordinal features are categorical features whose potential values are ordered. These features are also represented using a CategoricalVector, except the levels in the vector are marked as ordered when constructing the vector:

df.Species = CategoricalArray(df.Species, ordered=true)

By default the order of the levels in the feature will be the standard sort order. You can inspect the current order with levels:

levels(df.Species)
3-element Array{String,1}:
"setosa"
"versicolor"
"virginica"

You can change the ordering with levels!:

levels!(df.Species, ["setosa", "virginica", "versicolor"])

### Mixed features

Interpretable AI algorithms also support an additional type of feature that consists of both numeric/ordinal and categoric values, which we refer to as a "mixed" feature. For instance, we might have the possible options for a feature "age":

• the age in years as a numeric value
• "Did not provide", indicating the person did not share their age
• missing, indicating the age was not recorded

Note that "Did not provide" and missing are distinct reasons for having no numeric values, and there can be value in handling these separately.

df.Age = rand([1, 2, 3, "Did not provide", missing], nrow(df))
150-element Array{Any,1}:
3
2
3
"Did not provide"
3
"Did not provide"
missing
missing
1
3
⋮
1
missing
2
3
1
3
2
missing
2

Notice that the type of this vector is Any. If we just pass in this data with no extra information, it is not clear to the algorithm how it should be handled. We represent these features by constructing a special vector using the make_mixed_data function on the relevant column in the DataFrame.

df.Age = IAI.make_mixed_data(df.Age)
150-element Array{MixedDatum{Float64},1}:
3.0
2.0
3.0
"Did not provide"
3.0
"Did not provide"
missing
missing
1.0
3.0
⋮
1.0
missing
2.0
3.0
1.0
3.0
2.0
missing
2.0

We can undo this transformation and recover the original column if needed with the undo_mixed_data function.

Similarly, we might have a column in the data that is a mix of categoric and ordinal features. We might have a feature that consists of a letter grade (D through A), as well as permitting "Not graded" and missing.

df.Grade = rand(["A", "B", "C", "D", "Not graded", missing], nrow(df))
150-element Array{Union{Missing, String},1}:
"C"
"B"
missing
"C"
"D"
"C"
"D"
"A"
⋮
missing
"A"
"A"
"C"
"D"
"B"
"Not graded"

As before, we use make_mixed_data to signal to the algorithm to treat this column as a mix of categoric and ordinal values. This time we have to specify the levels of the ordinal values in increasing order.

df.Grade = IAI.make_mixed_data(df.Grade, ["D", "C", "B", "A"])
150-element Array{MixedDatum{CategoricalValue{Any,UInt32}},1}:
"C"
"B"
missing
"C"
"D"
"C"
"D"
"A"
⋮
missing
"A"
"A"
"C"
"D"
"B"
"Not graded"

### Missingness

Julia represents missing values with the value missing. It is possible to set values to missing directly in a DataFrame. However, if the type of the column in the DataFrame does not support missing values, you will see the following error:

df[1:5, :SepalLength] .= missing
ERROR: MethodError: Cannot convert an object of type Missing to an object of type Float64

In this case, you can call allowmissing! first to permit missing values in the DataFrame, and then set the values to missing:

allowmissing!(df, :SepalLength)
df[1:5, :SepalLength] .= missing

You can also use the missingstrings keyword argument to CSV.read to specify the strings that denote missing values:

df = CSV.read("iris.csv", missingstrings=["setosa"])

## Target

We categorize problems into different "tasks" based on the number and type of the target arguments.

### Classification

The target for classification problems is a single vector y containing the label for each point.

By default, the labels will be displayed in all outputs using the normal sorting order for the values. If you would like to specify the order in which to display the labels, you can pass in y as an ordered CategoricalVector as described in Ordinal features, and the labels will be displayed according to the level order of y.

### Regression

The target for regression problems is a single numeric vector y containing the value for each point.

### Prescription

The target for prescription problems is made up of two vectors:

• a vector treatments containing the treatment labels for each point (same format as a classification target)
• a vector outcomes containing the numeric outcomes for each point (same format as a regression target)

### Survival

The target for survival problems is made up of two vectors:

• a boolean vector deaths indicating whether each point was a death (true) or a censored (false) observation
• a numeric vector times containing the observation time for each point

### Imputation

There is no target for imputation problems as the problem is unsupervised.

## Sample Weights

Many of the IAI functions accept a sample_weight keyword argument allowing your to specify different weights in the scoring criterion for each point in the data. There are multiple ways you can use this argument to specify the weights:

• nothing (default) will assign equal weight to all points
• supply a Vector or StatsBase.Weights containing the weights to use for each point

For classification and prescription problems, the following options are also available to specify weights based on the label of each point (the class for classification and the treatment for prescription):

• supply a Dict indicating the weight to assign to each label, e.g. to assign twice the weight to points of type "A" over type "B", we can use Dict("A" => 2, "B" => 1)

• :autobalance will choose weights that balance the label distribution, so that the total weight for each label is the same

For information on when you might want to use sample weights, see tips and tricks.

## Data Splitting

We can split the data into training and testing with the split_data function by specifying the task type, the features and targets:

X = df[:, 1:4]
y = df[:, 5]
(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y)
length(train_y), length(test_y)
(105, 45)

The task type controls how the data is split:

• a stratified split on the class labels for :classification
• a stratified split on the treatments for :prescription
• a random split, otherwise

We can optionally control the relative size of the two parts with the train_proportion keyword:

(train_X, train_y), (test_X, test_y) = IAI.split_data(:classification, X, y,
train_proportion=0.5)
length(train_y), length(test_y)
(75, 75)