Data Preparation Guide
Conceptually, the R interface to IAI accepts data in the same formats as described for Julia, except as R data structures. Below we discuss how to supply the various data elements from R.
Features
The feature matrix X
can be supplied in two ways:
- a
matrix
containing all numeric entries, or - a
data.frame
We recommend supplying the features as a data.frame
since it supports more features like categorical data and missing values.
You can read in CSV files as a data.frame
with:
df <- read.table("iris.csv", sep = ",", header = T)
SepalLength SepalWidth PetalLength PetalWidth Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
[ reached 'max' / getOption("max.print") -- omitted 138 rows ]
Numerical features
A column of all numeric variables will be treated as numeric by default.
Categorical features
Categorical features are represented using a factor
. For example, it is possible to mark existing columns in the dataframe as categorical:
df$Species <- as.factor(df$Species)
Refer to the R documentation on factors for more information on working with categorical features.
Ordinal features
Ordinal features are represented using an ordered factor
. For example, you can mark an existing column as ordered with:
df$Species <- factor(df$Species, ordered = T,
levels = c("virginica", "versicolor", "setosa"))
Again, refer to the R documentation on factors for more information on working with ordinal features.
Mixed features
You can mark a feature as an IAI mixed data type using as.mixeddata
. For example, we can generate a mixed numeric and categoric column and mark it as mixed:
set.seed(1)
df$mixed <- rnorm(150)
df$mixed[1:5] <- NA # Insert some missing values
df$mixed[6:10] <- "Not graded"
df$mixed <- iai::as.mixeddata(df$mixed, c("Not graded"))
SepalLength SepalWidth PetalLength PetalWidth Species mixed
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa NA
3 4.7 3.2 1.3 0.2 setosa NA
4 4.6 3.1 1.5 0.2 setosa NA
5 5.0 3.6 1.4 0.2 setosa NA
6 5.4 3.9 1.7 0.4 setosa Not graded
7 4.6 3.4 1.4 0.3 setosa Not graded
8 5.0 3.4 1.5 0.2 setosa Not graded
9 4.4 2.9 1.4 0.2 setosa Not graded
10 4.9 3.1 1.5 0.1 setosa Not graded
[ reached 'max' / getOption("max.print") -- omitted 140 rows ]
Missingness
Missing values are represented in a data.table
by the special value NA
.
For example, the following code rewrites all values of "setosa" to be missing:
df$Species[df$Species == "setosa"] <- NA
Target
The format of the target data depends on the problem type as described for Julia. For each target vector, you should pass an atomic vector
with elements of the appropriate type.
Sample Weights
Many functions in the R interface functions accept a sample_weight
keyword argument in the same way as the Julia modules. There are multiple ways you can use this argument to specify the weights:
NULL
(default) will assign equal weight to all points- supply an atomic
vector
containing the weights to use for each point
For classification and prescription problems, you can also use:
a
list
indicating the weight to assign to each label, e.g. to assign twice the weight to points of type "A" over type "B", we can uselist(A = 2, B = 1)
"autobalance"
to choose weights that balance the label distribution, so that the total weight for each label is the same
Data Splitting
We can split the data into training and testing with split_data
, which returns a named list
with two entries, train
and test
. Each entry is also a named list
, with contents depending on the problem type:
- classification and regression:
X
andy
- survival:
X
,deaths
, andtimes
- prescription:
X
,treatments
, andoutcomes
- policy:
X
andrewards
- imputation:
X
For example, the following code splits classification into training and testing sets:
X <- df[, 1:4]
y <- df$Species
iai::set_julia_seed(1)
split <- iai::split_data("classification", X, y)
train_X <- split$train$X
train_y <- split$train$y
test_X <- split$test$X
test_y <- split$test$y