Data Preparation Guide

Conceptually, the R interface to IAI accepts data in the same formats as described for Julia, except as R data structures. Below we discuss how to supply the various data elements from R.

Features

The feature matrix X can be supplied in two ways:

We recommend supplying the features as a data.frame since it supports more features like categorical data and missing values.

You can read in CSV files as a data.frame with:

df <- read.table("iris.csv", sep = ",", header = T)
   SepalLength SepalWidth PetalLength PetalWidth Species
1          5.1        3.5         1.4        0.2  setosa
2          4.9        3.0         1.4        0.2  setosa
3          4.7        3.2         1.3        0.2  setosa
4          4.6        3.1         1.5        0.2  setosa
5          5.0        3.6         1.4        0.2  setosa
6          5.4        3.9         1.7        0.4  setosa
7          4.6        3.4         1.4        0.3  setosa
8          5.0        3.4         1.5        0.2  setosa
9          4.4        2.9         1.4        0.2  setosa
10         4.9        3.1         1.5        0.1  setosa
11         5.4        3.7         1.5        0.2  setosa
12         4.8        3.4         1.6        0.2  setosa
 [ reached 'max' / getOption("max.print") -- omitted 138 rows ]

Numerical features

A column of all numeric variables will be treated as numeric by default.

Categorical features

Categorical features are represented using a factor. For example, it is possible to mark existing columns in the dataframe as categorical:

df$Species <- as.factor(df$Species)

Refer to the R documentation on factors for more information on working with categorical features.

Ordinal features

Ordinal features are represented using an ordered factor. For example, you can mark an existing column as ordered with:

df$Species <- factor(df$Species, ordered = T,
                     levels = c("virginica", "versicolor", "setosa"))

Again, refer to the R documentation on factors for more information on working with ordinal features.

Mixed features

You can mark a feature as an IAI mixed data type using as.mixeddata. For example, we can generate a mixed numeric and categoric column and mark it as mixed:

set.seed(1)
df$mixed <- rnorm(150)
df$mixed[1:5] <- NA  # Insert some missing values
df$mixed[6:10] <- "Not graded"
df$mixed <- iai::as.mixeddata(df$mixed, c("Not graded"))
   SepalLength SepalWidth PetalLength PetalWidth Species      mixed
1          5.1        3.5         1.4        0.2  setosa         NA
2          4.9        3.0         1.4        0.2  setosa         NA
3          4.7        3.2         1.3        0.2  setosa         NA
4          4.6        3.1         1.5        0.2  setosa         NA
5          5.0        3.6         1.4        0.2  setosa         NA
6          5.4        3.9         1.7        0.4  setosa Not graded
7          4.6        3.4         1.4        0.3  setosa Not graded
8          5.0        3.4         1.5        0.2  setosa Not graded
9          4.4        2.9         1.4        0.2  setosa Not graded
10         4.9        3.1         1.5        0.1  setosa Not graded
 [ reached 'max' / getOption("max.print") -- omitted 140 rows ]

Missingness

Missing values are represented in a data.table by the special value NA.

For example, the following code rewrites all values of "setosa" to be missing:

df$Species[df$Species == "setosa"] <- NA

Target

The format of the target data depends on the problem type as described for Julia. For each target vector, you should pass an atomic vector with elements of the appropriate type.

Sample Weights

Many functions in the R interface functions accept a sample_weight keyword argument in the same way as the Julia modules. There are multiple ways you can use this argument to specify the weights:

  • NULL (default) will assign equal weight to all points
  • supply an atomic vector containing the weights to use for each point

For classification and prescription problems, you can also use:

  • a list indicating the weight to assign to each label, e.g. to assign twice the weight to points of type "A" over type "B", we can use

      list(A = 2, B = 1)
  • "autobalance" to choose weights that balance the label distribution, so that the total weight for each label is the same

Data Splitting

We can split the data into training and testing with split_data, which returns a named list with two entries, train and test. Each entry is also a named list, with contents depending on the problem type:

  • classification and regression: X and y
  • survival: X, deaths, and times
  • prescription: X, treatments, and outcomes
  • policy: X and rewards
  • imputation: X

For example, the following code splits classification into training and testing sets:

X <- df[, 1:4]
y <- df$Species
iai::set_julia_seed(1)
split <- iai::split_data("classification", X, y)
train_X <- split$train$X
train_y <- split$train$y
test_X <- split$test$X
test_y <- split$test$y