Advanced
Parallelization
Many learners support training in parallel across multiple threads, processes or even machines using the parallelism built into Julia.
In order to parallelize the training process, you will need to choose to parallelize either over multiple threads or over multiple processes - these approaches cannot be used simultaneously.
We recommend using multiple threads for best performance as the overhead is significantly lower than for multiple processes
Parallelization with multiple threads
You can start Julia with multiple threads in two ways:
- specify the number of threads using the environment variable
JULIA_NUM_THREADS
(i.e. settingJULIA_NUM_THREADS=8
will use 8 threads) - Julia 1.5+ only: specify the number of threads using the
-t/--threads
flag when starting Julia (i.e. starting Julia withjulia -t 8
will use 8 threads)
If using Jupyter, you can create a separate kernel that uses a specific number of threads:
using IJulia
installkernel("Julia (IAI, 8 threads)", "--sysimage=path/to/sys.dylib",
env=Dict("JULIA_NUM_THREADS" => "8"))
You can only set the number of Julia threads when starting Julia - once Julia is running it is not possible to increase the number of threads
The parallelism of the learner fitting algorithm can be controlled via the num_threads
parameter on the learner (see Parameters). This parameter is an Integer
specifying the number of threads to use. All threads will be used by default, but a smaller number can be specified if desired.
If using the R or Python interface, you will need to specify the number of threads by setting the JULIA_NUM_THREADS
environment variable before initializing the IAI interface:
For R:
Sys.setenv(JULIA_NUM_THREADS=8) iai::iai_setup()
For Python:
import os os.environ['JULIA_NUM_THREADS'] = '8' from interpretableai import iai
Parallelization with multiple processes
You can start Julia with extra worker processes using the -p/--procs
flag when running Julia from the terminal. For example, the following shell command will start Julia with three additional processes, for a total of four.
bash$ julia -p 3
You can also add additional worker processes to an existing Julia session using the addprocs
function. The following Julia code adds three additional processes for a total of four:
using Distributed
addprocs(3)
The parallelism of the learner fitting algorithm can be controlled via the parallel_processes
parameter on the learner (see Parameters). There are two options for specifying this parameter:
nothing
will use all available processes during training- Specify a
Vector
containing the IDs of the processes to use during training. This needs to be a subset of the available processes, which can be found by runningDistributed.procs()
.
If using the R or Python interface, you can add additional Julia processes using the iai::add_julia_processes
(R) or iai.add_julia_processes
(Python) functions.
Rich Multimedia Output Control
There are many learners and other objects that take advantage of Julia's rich multimedia output to produce interactive browser visualizations in Jupyter notebooks. Because these displays happen automatically, there is no opportunity to pass any desired keyword arguments to these display functions. If you would like to customize these visualizations with keyword arguments, you can instead use the set_rich_output_param!
to specify the argument, which will be passed to the display function when automatically called.
Training Checkpoints
Learners can be configured to save periodic "checkpoints" during training, which store the current state of the training process. It is then possible to resume training from a checkpoint file, which can be useful if the original training process was interrupted for any reason.
Checkpointing is enabled by setting any or all of the following learner parameters:
checkpoint_file
specifies the base name used when creating the checkpoint files. The default value is"checkpoint"
, meaning that the checkpoint files are named"checkpoint.json"
and similar.checkpoint_dir
specifies the directory in which to save the checkpoint files, and default to the current directory.
If either of these parameters is specified, checkpointing will be enabled when the learner is trained. If the learner is being trained in a GridSearch
, checkpointing will be enabled for the entire grid search process.
Every time a checkpoint is created, a separate checkpoint file is saved with a timestamp appended to the filename. This means that all checkpoints from the training are available for later use. If you do not wish to retain all checkpoint files, you can use the checkpoint_max_files
learner parameter to specify the maximum number of files to keep. For instance, if checkpoint_max_files=3
, then only the three most recent checkpoint files will be kept.
In addition to the timestamped checkpoint files, a non-timestamped checkpoint file is created that always contains the most recent checkpoint saved during training.
To resume training from a checkpoint file, you can pass the path to this file to resume_from_checkpoint
. This will continue the training process from the state saved in the checkpoint file, and return the trained learner or grid search.