Data Preparation Guide

Conceptually, the Python interface to IAI accepts data in the same formats as described for Julia, except as Python data structures. Below we discuss how to supply the various data elements from Python.

Features

The feature matrix X can be supplied in two ways:

We recommend supplying the features as a pandas.DataFrame since it supports more features like categorical data and missing values.

You can read in CSV files as a pandas.DataFrame with:

import pandas as pd
df = pd.read_csv("iris.csv")
     SepalLength  SepalWidth  PetalLength  PetalWidth    Species
0            5.1         3.5          1.4         0.2     setosa
1            4.9         3.0          1.4         0.2     setosa
2            4.7         3.2          1.3         0.2     setosa
3            4.6         3.1          1.5         0.2     setosa
4            5.0         3.6          1.4         0.2     setosa
5            5.4         3.9          1.7         0.4     setosa
6            4.6         3.4          1.4         0.3     setosa
..           ...         ...          ...         ...        ...
143          6.8         3.2          5.9         2.3  virginica
144          6.7         3.3          5.7         2.5  virginica
145          6.7         3.0          5.2         2.3  virginica
146          6.3         2.5          5.0         1.9  virginica
147          6.5         3.0          5.2         2.0  virginica
148          6.2         3.4          5.4         2.3  virginica
149          5.9         3.0          5.1         1.8  virginica

[150 rows x 5 columns]

Numerical features

A column of all numeric variables will be treated as numeric by default.

Categorical features

Categorical features are represented using a pandas.Categorical. For example, it is possible to mark existing columns in the dataframe as categorical:

df.Species = df.Species.astype('category')

Refer to the pandas documentation on categorical data for more information on working with categorical features.

Ordinal features

Ordinal features are also represented using a pandas.Categorical, except the levels in the vector are marked as ordered. For example, you can mark an existing column as ordered with:

from pandas.api.types import CategoricalDtype
df.Species = df.Species.astype(
    CategoricalDtype(['virginica', 'versicolor', 'setosa'], ordered=True))

Again, refer to the pandas documentation on categorical data for more information on working with ordinal features.

Mixed features

The IAI mixed data type is represented using the MixedData class, which implements the pandas.api.extensions.ExtensionArray interface.

For example, we can generate a mixed numeric and categoric column and mark it as mixed with MixedData:

import numpy as np
np.random.seed(0)
mixed = np.random.randint(1, 4, df.shape[0]).astype('object')
mixed[0:5] = np.nan  # Insert some missing values
mixed[5:10] = "Not graded"
mixed = iai.MixedData(mixed)
[NaN, NaN, NaN, NaN, NaN, ..., 2, 1, 1, 1, 1]
Length: 150
Categories (4, object): [1, 2, 3, 'Not graded']
Ordinal levels: None

We can then add it to the dataframe:

df = df.assign(mixed=mixed)

Missingness

Missing values are represented in a pandas.DataFrame by the special value np.nan.

For example, the following code marks the first 30 values of "Species" as missing:

df.loc[:30, 'Species'] = np.nan

Target

The format of the target data depends on the problem type as described for Julia. For each target vector, you should pass either a 1-D numpy.array or a pandas.Series with elements of the appropriate type.

Sample Weights

The functions in the Python interface functions accept a sample_weight keyword argument in the same way as the Julia modules. There are multiple ways you can use this argument to specify the weights:

  • None (default) will assign equal weight to all points
  • supply a 1-D numpy.array or a pandas.Series containing the weights to use for each point

For classification and prescription problems, you can also use:

  • a dict indicating the weight to assign to each label, e.g. to assign twice the weight to points of type "A" over type "B", we can use

      {'A': 2, 'B': 1}
  • 'autobalance' to choose weights that balance the label distribution, so that the total weight for each label is the same

Data Splitting

We can split the data into training and testing with split_data:

X = df.iloc[:, 0:4]
y = df.Species
(train_X, train_y), (test_X, test_y) = iai.split_data('classification', X, y,
                                                      seed=1)