Advanced

Parallelization

Many learners support training in parallel across multiple threads, processes or even machines using the parallelism built into Julia.

In order to parallelize the training process, you will need to choose to parallelize either over multiple threads or over multiple processes - these approaches cannot be used simultaneously.

Tip

We recommend using multiple threads for best performance as the overhead is significantly lower than for multiple processes

Parallelization with multiple threads

You can start Julia with multiple threads in two ways:

  • specify the number of threads using the environment variable JULIA_NUM_THREADS (i.e. setting JULIA_NUM_THREADS=8 will use 8 threads)
  • Julia 1.5+ only: specify the number of threads using the -t/--threads flag when starting Julia (i.e. starting Julia with julia -t 8 will use 8 threads)

If using Jupyter, you can create a separate kernel that uses a specific number of threads:

using IJulia
installkernel("Julia (IAI, 8 threads)", "--sysimage=path/to/sys.dylib",
              env=Dict("JULIA_NUM_THREADS" => "8"))
Warning

You can only set the number of Julia threads when starting Julia - once Julia is running it is not possible to increase the number of threads

The parallelism of the learner fitting algorithm can be controlled via the num_threads parameter on the learner (see Parameters). This parameter is an Integer specifying the number of threads to use. All threads will be used by default, but a smaller number can be specified if desired.

If using the R or Python interface, you will need to specify the number of threads by setting the JULIA_NUM_THREADS environment variable before initializing the IAI interface:

  • For R:

    Sys.setenv(JULIA_NUM_THREADS=8)
    iai::iai_setup()
  • For Python:

    import os
    os.environ['JULIA_NUM_THREADS'] = '8'
    from interpretableai import iai

Parallelization with multiple processes

You can start Julia with extra worker processes using the -p/--procs flag when running Julia from the terminal. For example, the following shell command will start Julia with three additional processes, for a total of four.

bash$ julia -p 3

You can also add additional worker processes to an existing Julia session using the addprocs function. The following Julia code adds three additional processes for a total of four:

using Distributed
addprocs(3)

The parallelism of the learner fitting algorithm can be controlled via the parallel_processes parameter on the learner (see Parameters). There are two options for specifying this parameter:

  • nothing will use all available processes during training
  • Specify a Vector containing the IDs of the processes to use during training. This needs to be a subset of the available processes, which can be found by running Distributed.procs().

If using the R or Python interface, you can add additional Julia processes using the iai::add_julia_processes (R) or iai.add_julia_processes (Python) functions.

Rich Multimedia Output Control

There are many learners and other objects that take advantage of Julia's rich multimedia output to produce interactive browser visualizations in Jupyter notebooks. Because these displays happen automatically, there is no opportunity to pass any desired keyword arguments to these display functions. If you would like to customize these visualizations with keyword arguments, you can instead use the set_rich_output_param! to specify the argument, which will be passed to the display function when automatically called.

Training Checkpoints

Learners can be configured to save periodic "checkpoints" during training, which store the current state of the training process. It is then possible to resume training from a checkpoint file, which can be useful if the original training process was interrupted for any reason.

Checkpointing is enabled by setting any or all of the following learner parameters:

  • checkpoint_file specifies the base name used when creating the checkpoint files. The default value is "checkpoint", meaning that the checkpoint files are named "checkpoint.json" and similar.
  • checkpoint_dir specifies the directory in which to save the checkpoint files, and default to the current directory.

If either of these parameters is specified, checkpointing will be enabled when the learner is trained. If the learner is being trained in a GridSearch, checkpointing will be enabled for the entire grid search process.

Every time a checkpoint is created, a separate checkpoint file is saved with a timestamp appended to the filename. This means that all checkpoints from the training are available for later use. If you do not wish to retain all checkpoint files, you can use the checkpoint_max_files learner parameter to specify the maximum number of files to keep. For instance, if checkpoint_max_files=3, then only the three most recent checkpoint files will be kept.

In addition to the timestamped checkpoint files, a non-timestamped checkpoint file is created that always contains the most recent checkpoint saved during training.

To resume training from a checkpoint file, you can pass the path to this file to resume_from_checkpoint. This will continue the training process from the state saved in the checkpoint file, and return the trained learner or grid search.