Data Preparation Guide
Conceptually, the Python interface to IAI accepts data in the same formats as described for Julia, except as Python data structures. Below we discuss how to supply the various data elements from Python.
Features
The feature matrix X
can be supplied in two ways:
- a 2-D
numpy.array
containing all numeric entries, or - a
pandas.DataFrame
We recommend supplying the features as a pandas.DataFrame
since it supports more features like categorical data and missing values.
You can read in CSV files as a pandas.DataFrame
with:
import pandas as pd
df = pd.read_csv("iris.csv")
SepalLength SepalWidth PetalLength PetalWidth Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
.. ... ... ... ... ...
143 6.8 3.2 5.9 2.3 virginica
144 6.7 3.3 5.7 2.5 virginica
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
Numerical features
A column of all numeric variables will be treated as numeric by default.
Categorical features
Categorical features are represented using a pandas.Categorical
. For example, it is possible to mark existing columns in the dataframe as categorical:
df.Species = df.Species.astype('category')
Refer to the pandas documentation on categorical data for more information on working with categorical features.
Ordinal features
Ordinal features are also represented using a pandas.Categorical
, except the levels in the vector are marked as ordered. For example, you can mark an existing column as ordered with:
from pandas.api.types import CategoricalDtype
df.Species = df.Species.astype(
CategoricalDtype(['virginica', 'versicolor', 'setosa'], ordered=True))
Again, refer to the pandas documentation on categorical data for more information on working with ordinal features.
Mixed features
The IAI mixed data type is represented using the MixedData
class, which implements the pandas.api.extensions.ExtensionArray
interface.
For example, we can generate a mixed numeric and categoric column and mark it as mixed with MixedData
:
import numpy as np
np.random.seed(0)
mixed = np.random.randint(1, 4, df.shape[0]).astype('object')
mixed[0:5] = np.nan # Insert some missing values
mixed[5:10] = "Not graded"
mixed = iai.MixedData(mixed)
[NaN, NaN, NaN, NaN, NaN, ..., 2, 1, 1, 1, 1]
Length: 150
Categories (4, object): [1, 2, 3, 'Not graded']
Ordinal levels: None
We can then add it to the dataframe:
df = df.assign(mixed=mixed)
Missingness
Missing values are represented in a pandas.DataFrame
by the special value np.nan
.
For example, the following code marks the first 30 values of "Species" as missing:
df.loc[:30, 'Species'] = np.nan
Target
The format of the target data depends on the problem type as described for Julia. For each target vector, you should pass either a 1-D numpy.array
or a pandas.Series
with elements of the appropriate type.
Sample Weights
The functions in the Python interface functions accept a sample_weight
keyword argument in the same way as the Julia modules. There are multiple ways you can use this argument to specify the weights:
None
(default) will assign equal weight to all points- supply a 1-D
numpy.array
or apandas.Series
containing the weights to use for each point
For classification and prescription problems, you can also use:
a
dict
indicating the weight to assign to each label, e.g. to assign twice the weight to points of type "A" over type "B", we can use{'A': 2, 'B': 1}
'autobalance'
to choose weights that balance the label distribution, so that the total weight for each label is the same
Data Splitting
We can split the data into training and testing with split_data
:
X = df.iloc[:, 0:4]
y = df.Species
(train_X, train_y), (test_X, test_y) = iai.split_data('classification', X, y,
seed=1)