Optimal Tree Learners
OptimalTrees provides many learners for training optimal tree models, which we describe on this page along with a guide to their parameters.
Shared Parameters
All of the learners provided by OptimalTrees are OptimalTreeLearner
s. In addition to the shared learner parameters, these learners support the following parameters to control the behavior of the OptimalTrees algorithm.
Commonly-used shared parameters
max_depth
max_depth
accepts a non-negative Integer
to control the maximum depth of the fitted tree. This parameter must always be explicitly set or tuned. We recommend tuning this parameter using the grid search process described in the guide to parameter tuning.
minbucket
minbucket
controls the minimum number of points that must be present in every leaf node of the fitted tree. There are two ways to specify the minbucket
:
- a positive
Integer
specifying the minimum number of points for each leaf - a
Real
between 0 and 1 indicating the minimum proportion of the points that must be present in each leaf (e.g. 0.01 means each leaf must have at least 1% of the points)
The default value is 1
, effectively applying no restriction. If you notice that some leaves in the tree have relatively few points, you might like to increase minbucket
to prevent these leaves from occuring.
cp
cp
is a Real
known as the complexity parameter that determines the tradeoff between the accuracy and complexity of the tree to control overfitting. When training the model, it tries to optimize the following objective function:
\[\min \text{error} + \texttt{cp} * \text{complexity}\]
A higher value of cp
increases the penalty on each split in the tree, leading to shallower trees. This parameter must always be explicitly set or tuned, and should almost always be chosen using the autotuning procedure described in the guide to parameter tuning.
missingdatamode
missingdatamode
specifies the method to use for handling missing values when constructing the splits in the tree. The following options are supported:
:none
: The default option, an error will be thrown if missing data is encountered:separate_class
: The separate class algorithm treats the missing values as a separate class when creating a split, so in addition to deciding how to split the non-missing values as normal, the tree will decide to send all missing values to either the lower or upper child of the split:always_left
: Always send missing values to the lower child of the split:always_right
: Always send missing values to the upper child of the split
More details on each method are available in the paper by Ding and Simonoff (2010).
You can also specify values for specific features using a FeatureMapping
. For instance, we can specify :separate_class
for one feature and :always_left
for the rest with:
Dict(:feature => :separate_class, Not(:feature) => :always_left)
When using a FeatureMapping
, any feature without a specified value will default to :none
.
Missing values are not directly supported when using hyperplane splits or linear regression in the leaves. You can refer to the guide on missing data for Optimal Feature Selection to see examples of how to deal with the presence of missing data in such problems.
ls_num_tree_restarts
ls_num_tree_restarts
is an Integer
specifying the number of random restarts to use in the local search algorithm. Must be positive and defaults to 100
. The performance of the tree typically increases as this value is increased, but with quickly diminishing returns. The computational cost of training increases linearly with this value. You might like to try increasing this value if you are seeing instability in your results, but our experience is that there is not much gain to be had increasing beyond 1000
.
split_features
split_features
specifies the set of features in the data that are allowed to be used by the splits in the tree. Defaults to All()
, which allows for all features to be used in the splits. See FeatureSet
for the different ways to specify this value.
This parameter only needs to be specified if you want to restrict the splits in the tree to a subset of the features.
refit_learner
refit_learner
can be used to automatically refit a new model in each of the leaves of the trained tree. As an example, you can use GLMNetCVClassifier
or OptimalFeatureSelectionClassifier
to refit logistic regression models in each leaf of a classification tree, which can improve the model performance. For an example of this, please see the guide to classification trees with logistic regression.
Hyperplane-related parameters
hyperplane_config
hyperplane_config
controls the behavior of hyperplane splits in the tree fitting process. To simply enable standard hyperplane splits in the tree, you should pass
hyperplane_config=(sparsity=:all,)
For more advanced control of the hyperplane splits, there are additional options you can specify. You must pass a NamedTuple
or a vector of NamedTuple
s. Each NamedTuple
can contain one or more of the following keys:
sparsity
controls the maximum number of features used in each hyperplane split. SeeRelativeParameterInput
for the different ways to specify this value.feature_set
specifies the set of potential features used in each hyperplane split. Defaults toAll()
, allowing all features to be considered for hyperplane splits. You can also specify a set of features to restrict which features are allowed to be used in the hyperplane splits. SeeFeatureSet
for the different ways to specify this value.values
specifies the values that hyperplane weights can take::continuous
is the default option, allowing any continuous values to be used for the weights:discrete
restricts the weights to be integer-valued- You can also pass a set of real values to restrict all hyperplane weights to be chosen from this set of possible values
penalty
is an increasing vector ofReal
s that specifies the complexity penalty incurred by any given number of features included in the hyperplane split. The default penalty is1:p
wherep
is the number of features specified infeature_set
. This means that adding an additional feature in the hyperplane split has the same complexity as adding an additional parallel split. Values that increase more slowly than this, such assqrt.(1:p)
, encourage denser hyperplane splits over parallel splits, whereas values that increase more quickly, such as(1:p).^2
, penalize dense hyperplane splits and encourage use of sparse hyperplane splits only when extremely powerful.
A different hyperplane split search is conducted for each NamedTuple
that is passed. For instance, the following parameter setting specifies that we want to consider hyperplane splits on the first three features with any continuous weights, and also hyperplane splits on features 6-9 with integer-valued weights:
hyperplane_config=[
(sparsity=:all, feature_set=[1, 2, 3]),
(sparsity=:all, feature_set=6:9, values=:discrete),
]
ls_num_hyper_restarts
ls_num_hyper_restarts
is an Integer
controlling the number of random restarts to use when optimizing hyperplane splits. Must be positive and defaults to 5
. If you are noticing that your hyperplane splits are taking a long time to train, you might like to try decreasing this value towards 1
to speed up the training without having a significant effect on the final performance.
Rarely-used shared parameters
ls_ignore_errors
ls_ignore_errors
is a Bool
that controls whether to ignore any errors that arise in the local search procedure. The only reason to enable this is to ignore any errors resulting from an edge-case bug before the bug is fixed. Defaults to false
.
localsearch
localsearch
is a Bool
that determines whether to use the local search procedure to train the tree. Defaults to true
, when set to false
, a greedy algorithm similar to CART will be used to train the tree.
ls_num_categoric_restarts
ls_num_categoric_restarts
is an Integer
controlling the number of random restarts to use when optimizing categoric splits. Must be positive and defaults to 10
. There is no need to change to this parameter.
ls_warmstart_criterion
ls_warmstart_criterion
is a Symbol
that specifies which criterion to use when generating the random restarts for the local search (see Scoring Criteria). The default value depends on the problem type and does not need to be changed.
ls_scan_reverse_split
ls_scan_reverse_split
is a Bool
that specifies whether to enable reverse-scan functionality in the local search to conduct a more intense search for optimal splits. This may lead to minor improvements in tree quality, but typically increases the training time by roughly 15-20%. The default value is false
, as this tradeoff is generally not worthwhile.
max_categoric_levels_before_warning
max_categoric_levels_before_warning
is an Integer
that controls the maximum number of categoric levels that are allowed before a warning is shown. We recommend extreme caution when using categoric features with many levels inside Optimal Trees, and recommend reading our advice for dealing with such features. To suppress the warning, you can increase the value of this parameter.
weighted_minbucket
weighted_minbucket
controls the minimum total sample weight that must be present among the points in every leaf node of the fitted tree. This value can be specified in the same way as minbucket
. The default value is 1
, which in most cases results in no extra restrictions.
cp_tuning_se_tolerance
cp_tuning_se_tolerance
controls the tolerance while tuning the value cp
during a grid search. The tuning procedure will select the largest value of cp
that achieves a validation score no more than this number of standard errors from the best validation score. The default value is 0
, which means that the cp
leading to the best validation score is selected. Setting this value to 1
will result in the so-called one-standard-error rule being used to select the best value for cp
.
checkpoint_interval
checkpoint_interval
controls how often checkpoint files are created if training checkpoints are enabled during training. Defaults to 10
, which means that a checkpoint will be created after every 10th tree is trained. Set to a smaller number to enable more frequent checkpoints, or a larger number to reduce the frequency of checkpointing.
Classification Learners
The OptimalTreeClassifier
is used for training Optimal Classification Trees. The following values for criterion
are permitted:
:misclassification
(default):gini
:entropy
regression_features
regression_features
specifies whether to fit logistic regression predictions in the leaves of the tree using linear discriminant analysis (LDA), and if so, the set of features that can be used in linear regressions. The default value is []
indicating only a constant term will be used for prediction. If set to All()
, there is no restriction on which features can be included. If you would like to restrict the LDA model to a subset of the features in the data, you can specify the set of feature indices that are permitted in the regression. See FeatureSet
for the different ways to specify this value.
Note the LDA models will always use all of the features specified in regression_features
, resulting in a fully-dense fit. For this reason, it may be useful to either restrict the number of features used and/or refit the tree using alternative logistic regression approaches after fitting.
When training the LDA models in each leaf, we assume independence among the features being used. This is another factor to consider when selecting which features to include, as we may wish to avoid selecting features that are highly correlated.
Regression Learners
The OptimalTreeRegressor
is used for training Optimal Regression Trees. The following values for criterion
are permitted:
In addition to the shared parameters, these learners also support the shared regression parameters as well as the following parameters.
regression_features
regression_features
specifies whether to use linear regresssion predictions in the leaves of the tree, and if so, the set of features that can be used in linear regressions. The default value is []
indicating only a constant term will be used for prediction. If set to All()
, there is no restriction on which features can be included. If you would like to restrict the linear regression to a subset of the features in the data, you can specify the set of feature indices that are permitted in the regression. See FeatureSet
for the different ways to specify this value.
regression_lambda
regression_lambda
is a non-negative Real
that controls the amount of regularization used when fitting the regression equations in each leaf. The default value is 0.01
. The parameter tuning guide provides recommendations on how to tune the value of this parameter.
regression_cd_algorithm
regression_cd_algorithm
controls the coordinate-descent algorithm used to fit the linear regression equations in each leaf. The default value is :covariance
, but in scenarios where the number of features is small you might achieve faster training by setting this parameter to :naive
.
regression_weighted_betas
regression_weighted_betas
specifies the regularization scheme to use for the linear regression equations. With the default value of false
, the regularization penalty is simply added across each leaf in the tree, meaning that smaller leaves are more likely to have simpler regression equations. When set to true
, the regularization penalty is weighted by the number of points in each leaf, so that the complexity of the linear regressions is not restricted by the size of the leaf. For more information, refer to the discussion on regularization schemes.
It is recommended to set this parameter to true
in prescription problems with linear regression in the leaves so that the tree is not penalized for any additional splitting that is needed to refine prescriptions. In other applications, the default value is usually fine.
regression_intercept
regression_intercept
is a Bool
that specifies whether constant intercept terms should be included in the linear regression equations in each leaf. The default value is true
. If set to false
, both normalize_X
and normalize_y
must also be set to false
.
Survival Learners
The OptimalTreeSurvivalLearner
is used for training Optimal Survival Trees. The following values for criterion
are permitted:
In addition to the shared parameters, these learners also support the shared survival parameters as well as the following parameters.
death_minbucket
death_minbucket
is similar to minbucket
, except it specifies the number of non-censored observations that are required in each leaf.
The default value is 1
. As with minbucket
, we recommend raising this value only if you notice undesirable behavior in the fitted trees.
skip_curve_fitting
skip_curve_fitting
is a Bool
that controls whether to skip fitting survival curves during the tree training process, which can reduce memory usage. The default value is false
.
If curve fitting is skipped, any API functions that use the fitted survival curves will become inaccessible until the learner is updated to contain survival curves, which can be done with refit_leaves!
:
IAI.refit_leaves!(lnr, X, deaths, times, criterion=:integratedbrier)
Prescription Learners
OptimalTreePrescriptionMinimizer
and OptimalTreePrescriptionMaximizer
are used to train Optimal Prescriptive Trees. The learner you should select depends on whether the goal of your prescriptive problem is to minimize or maximize outcomes. The following values for criterion
are permitted:
These learners support all parameters of OptimalTreeRegressor
, the shared prescription parameters, and the following parameters.
treatment_minbucket
treatment_minbucket
is similar to minbucket
, except it specifies the number of points of a given treatment that must be in a leaf in order for that leaf to prescribe this treatment. For instance, if treatment_minbucket
is 10, then there must be 10 points of treatment A in a leaf before the leaf is allowed to consider treatment A for prescription.
The default value is 1
. As with minbucket
, we recommend raising this value only if you notice undesirable behavior in the fitted trees.
Policy Learners
OptimalTreePolicyMinimizer
and OptimalTreePolicyMaximizer
are used to train Optimal Policy Trees. The learner you should select depends on whether the goal of your prescriptive problem is to minimize or maximize outcomes. The following values for criterion
are permitted:
:reward
(default)
In addition to the shared parameters, these learners also support the shared policy parameters.
Multi-Task Learners
OptimalTreeMultiClassifier
and OptimalTreeMultiRegressor
are multi-task versions of OptimalTreeClassifier
and OptimalTreeRegressor
, respectively. They can be used to train trees that predict multiple classification or regression targets simultaneously.
These learners support all parameters of their respective single-task learners, as well as the shared multi-task parameters.