Turbofan: an interpretable predictive maintenance case

Predictive maintenance aims to find the right moment to perform maintenance so that an industrial system's components are not prematurely replaced while ensuring the reliability of the whole system. As complexity of industrial systems grows, understanding the ways in which they can fail becomes all the more challenging.

In this case study, we build models to help quantify the risk of failure for a machine in any moment and we illustrate how interpretability helps to understand the underlying failure mechanisms through two approaches:

  1. Predict failure within a given time window. We compare how Optimal Classification Trees compare to CART and XGBoost with respect to interpretability and performance.
  2. Predict the probability of failure over time with our novel Optimal Survival Tree algorithm.

We conduct our analysis on the turbofan dataset, released by NASA to study engine degradation.

Turbofan Datasets

The turbofan datasets present the following characteristics:

  • Each dataset consists of multiple multivariate time series, generated by a physical model of the turbofan.
  • Each dataset simulates the behavior of a turbofan in different atmospheric conditions and with different fault modes.
  • Each time series is from a different engine, i.e., the data can be considered to be from a fleet of engines of the same type.
  • Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. This wear and variation is considered normal, i.e., it is not considered a fault condition.
  • There are three operational settings that have a substantial effect on engine performance. These settings are also included in the data.
  • The engine is operating normally at the start of each time series, and develops a fault at some point during the series.
  • The data is contaminated with sensor noise.

Data Reading

We load the dataset 'FD_003':

  • It records 20k+ daily observations for 100 failure trajectories
  • Each engine is evaluated in the same atmospheric conditions
  • There are two causes of failure: High Pressure Compressor (HPC) degradation and fan degradation
import pandas as pd
name_df = pd.DataFrame(
    columns=['sensor_id', 'sensor_name', 'sensor_description'],
        ['sensor_01', 'T2', 'Total temperature at fan inlet'],
        ['sensor_02', 'T24', 'Total temperature at LPC outlet'],
        ['sensor_03', 'T30', 'Total temperature at HPC outlet'],
        ['sensor_04', 'T50', 'Total temperature at LPT outlet'],
        ['sensor_05', 'P2', 'Pressure at fan inlet'],
        ['sensor_06', 'P15', 'Total pressure in bypass-duct'],
        ['sensor_07', 'P30', 'Total pressure at HPC outlet'],
        ['sensor_08', 'Nf', 'Physical fan speed'],
        ['sensor_09', 'Nc', 'Physical core speed'],
        ['sensor_10', 'epr', 'Engine pressure ratio (P50/P2)'],
        ['sensor_11', 'Ps30', 'Static pressure at HPC outlet'],
        ['sensor_12', 'phi', 'Ratio of fuel flow to Ps30'],
        ['sensor_13', 'NRf', 'Corrected fan speed'],
        ['sensor_14', 'NRc', 'Corrected core speed'],
        ['sensor_15', 'BPR', 'Bypass Ratio'],
        ['sensor_16', 'farB', 'Burner fuel-air ratio'],
        ['sensor_17', 'htBleed', 'Bleed Enthalpy'],
        ['sensor_18', 'Nf_dmd', 'Demanded fan speed'],
        ['sensor_19', 'PCNfR_dmd', 'Demanded corrected fan speed'],
        ['sensor_20', 'W31', 'HPT coolant bleed'],
        ['sensor_21', 'W32', 'LPT coolant bleed'],
        ['sensor_22', 'T48', 'Total temperature at HPT outlet'],
        ['sensor_23', 'SmFan', 'Fan stall margin'],
        ['sensor_24', 'SmLPC', 'LPC stall margin'],
        ['sensor_25', 'SmHPC', 'HPC stall margin'],
name_df.set_index('sensor_id', inplace=True)

names = ['unit_nr', 'time', 'operation_setting_1', 'operation_setting_2',
names += name_df.sensor_name.to_list()

df = pd.read_csv('train_FD003.txt', sep=' ', header=None, names=names,

Data Cleaning

We remove columns with no values or with only one value

df.dropna(how='all', axis=1, inplace=True)
cols_const = [col for col in df.columns if len(df[col].unique()) <= 2]
df.drop(columns=cols_const, inplace=True)
       unit_nr  time  operation_setting_1  ...  htBleed    W31      W32
0            1     1              -0.0005  ...      391  39.11  23.3537
1            1     2               0.0008  ...      392  38.99  23.4491
2            1     3              -0.0014  ...      391  38.85  23.3669
3            1     4              -0.0020  ...      392  38.96  23.2951
4            1     5               0.0016  ...      392  39.14  23.4583
5            1     6               0.0011  ...      393  38.92  23.4281
6            1     7              -0.0038  ...      391  38.84  23.4087
...        ...   ...                  ...  ...      ...    ...      ...
24713      100   146               0.0016  ...      396  38.41  23.1246
24714      100   147               0.0004  ...      395  38.59  23.0528
24715      100   148              -0.0016  ...      394  38.44  22.9631
24716      100   149               0.0034  ...      395  38.50  22.9746
24717      100   150              -0.0016  ...      396  38.39  23.0682
24718      100   151              -0.0023  ...      395  38.31  23.0753
24719      100   152               0.0000  ...      396  38.56  23.0847

[24720 rows x 20 columns]

Add remaining useful life (RUL) column

For each unit, we compute the time remaining before the engine failure occured.

mapper = {}
for unit_nr in df['unit_nr'].unique():
    mapper[unit_nr] = df['time'].loc[df['unit_nr'] == unit_nr].max()

df['RUL'] = df['unit_nr'].apply(lambda nr: mapper[nr]) - df['time'] + 1

Data Visualization

We can plot the signals measured by the 18 sensors (Y-axis) over time (X-axis) for a given machine.

import matplotlib.pyplot as plt
df_unit = df[df.unit_nr == 1]
fig = plt.figure(figsize=(20, 25))
for i, col in enumerate(df_unit.columns[2:-1]):
    ax = fig.add_subplot(6, 3, i+1)
    df_unit.plot.scatter('time', col, title=col, ax=ax)

We observe that as the engine gets used, most sensors record a monotonous trend (not accounting for noise). This illustrates the fact that all the modules of the turbofan are influencing each other's behavior and lifetime. The goal of analysis will be to understand what behaviors forewarn failures and how the faults propagate through the entire system.

Predict Next Day Failures

We transform our predictive maintenance problem to a classification task. Our goal here is to predict whether the machine will fail on the following day, given the measurements available up to that day.

Data Preprocessing

In real-life scenarios, it is rare to have as many data points. We keep 1500 normal measurements and 100 failures to reduce the number of observed measurements while keeping the class imbalance.

df_error = df[df.RUL == 2].reset_index(drop=True)
df_no_error = df[df.RUL > 2].reset_index(drop=True)
df_no_error = df_no_error.sample(n=1500, random_state=0).reset_index(drop=True)

df_sampled = pd.concat([df_error, df_no_error]).reset_index(drop=True)
df_sampled = df_sampled.sample(frac=1, random_state=0).reset_index(drop=True)
df_sampled['error'] = df_sampled['RUL'] == 2

df_class = df_sampled.drop(columns=['unit_nr', 'time', 'RUL'])

We randomly split our data into training and testing. All models will be trained on the same training set and evaluated on the same testing set.

from interpretableai import iai
X = df_class.iloc[:, :-1]
y = df_class.iloc[:, -1]
(train_X, train_y), (test_X, test_y) = iai.split_data('classification', X, y,


We first run the classical Classification and Regression Tree (CART) method. Given how understandable decision trees are, CART is often the first method considered when interpretability matters.

grid = iai.GridSearch(
    max_depth=range(1, 10),
grid.fit(train_X, train_y)
Optimal Trees Visualization
train_auc = grid.score(train_X, train_y, criterion='auc')
test_auc = grid.score(test_X, test_y, criterion='auc')
{'train_auc': train_auc, 'test_auc': test_auc}
{'train_auc': 0.9927891156462585, 'test_auc': 0.7891111111111112}

The AUC on the testing set is correct but despite the validation on the depth and complexity parameter, the almost perfect AUC in training indicates that the model overfits to the data.

Consequently, most of the relationship between sensor measurements picked up by the model are not interpretable but mere artefacts of the training data and its noise. This model is therefore not reliable to understand the interactions between the system's components and false positives are likely to be predicted.


We then run XGBoost, praised for its accuracy in predictive tasks.

from xgboost import XGBClassifier, plot_importance
from sklearn.metrics import roc_auc_score

clf = XGBClassifier()
clf.fit(train_X, train_y)

y_pred = clf.predict_proba(test_X)[:, 1]
train_auc = roc_auc_score(train_y, clf.predict_proba(train_X)[:, 1])
test_auc = roc_auc_score(test_y, y_pred)
{'train_auc': train_auc, 'test_auc': test_auc}
{'train_auc': 0.9999999999999999, 'test_auc': 0.988}

We can plot the importance of every feature in the model:

plot_importance(clf, max_num_features=15)

The results are excellent in term of AUC, but it is challenging to understand the logic underlying the decisions the algorithm makes, beyond which variables are deemed important.

For more details on how to explain XGBoost models and how they compare with Optimal Trees, please refer to the Loan Default Risk case study.

Optimal Trees

grid = iai.GridSearch(
    max_depth=range(2, 4),
grid.fit_cv(train_X, train_y)
Optimal Trees Visualization