An Explainable XGBoost Approach to Predicting Car Prices¶

Overview¶

For a while I wanted to create a project where I dived into the explainability side of models a bit more, and I finally found the time to do it. Since tree-based models (be them sequentially boosted or ensembles) are a powerful and popular approach to both classification and regression tasks - while also being more explainable models than, say, neural networks - they are a great choice for developing models where discovering insights and explaining them to non-technical stakeholders is paramount (basically in any business situation). There is a rather extensive suite of tools at our disposal to dive into the "explainable ML" side of things, and understandably I will not be able to cover them all, but I will do my best to be thorough enough to give you a good overview of what we can achieve with these models and tools.

For this project, I leveraged the publicly available Kaggle Used Cars Dataset by Austin Reese. I modified it slightly (I dropped some columns I considered were not going to give me useful information for this project to make the dataset lighter), but it preserves the essence of the original. I will modify it further as part of the EDA/preprocessing steps described below, anyway.

I decided to use this dataset to create a model that could predict the price of a car utilising the features that are usually made available when putting it up for sale when used, and therefore to give the user an estimation of whether the sale price is fair compared to what the model thinks should be the price. Why? Because I find it very interesting and useful on many different levels, and a model such as this one could be argued to be a good companion for someone without extensive knowledge of the used car market to have a confident approximation on what a car should be valued at based on its brand, model, mileage, and more.

Since the purpose and scope of this project is to dive into the explainable ML side of things, I will not spend time diving into the EDA/preprocessing steps, but I provide the code and outputs for transparency purposes. I will also use a predefined XGBRegressor() model I obtained after spending some days running grid searches with cross-validations and comparing performance markers. The model is not perfect (experienced data scientist will quickly catch its pitfalls, and I'm sure they'll want to complain at what I did), but the point of this exercise is to delve into the explainable side of a given model, not to get the perfect model with the most straight-laced textbook approach.

As usual, if you find this useful, or find something that you disagree with (or, better, if you develop a better model with this approach), I would love to hear from you!

I hope you enjoy reading this as much as I did developing it =)

Developed on Python 3.10.6. Packages and versions are: Pandas 1.4.4; Numpy 1.26.4; Matplotlib 3.5.3; Seaborn 0.11.2; Shap 0.46.0; XGBoost 2.1.1; Sklearn 1.2.0; PPScore 1.3.0

Quick EDA + Preprocessing¶

Import Libraries and Data¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import time 
warnings.filterwarnings('ignore')
import shap
shap.initjs()
from scipy.stats import zscore

sns.set_theme(
    context='talk',
    font_scale=0.7,
    palette = ['#0F00F5', '#3061FF', '#9AB1FF', '#CDD9FF', '#E6ECFF','#E5E5E5',
               '#B6BBCB', '#878B98','#696A6F','#292A2E'],
    style = {
        'axes.facecolor': '#FFFFFF',
         'axes.edgecolor': '#000000',
         'axes.legend.edgecolor':'#FFFFFF',
         'axes.grid': False,
         'axes.axisbelow': 'line',
         'axes.labelcolor': 'black',
         'figure.facecolor': '#FFFFFF',
         'grid.color': '#b0b0b0',
         'grid.linestyle': '-',
         'text.color': 'black',
         'xtick.color': 'black',
         'ytick.color': 'black',
         'xtick.direction': 'out',
         'ytick.direction': 'out',
         'patch.edgecolor': '#FFFFFF',
         'patch.force_edgecolor': True,
         'image.cmap': 'viridis',
         'font.family': ['sans-serif'],
         'font.sans-serif': 'Helvetica Neue',
         'xtick.bottom': False,
         'xtick.top': False,
         'ytick.left': False,
         'ytick.right': False,
         'axes.spines.left': False,
         'axes.spines.bottom': False,
         'axes.spines.right': False,
         'axes.spines.top': False
    }
)
In [38]:
cars_df_full = pd.read_pickle('data/cars_data.pkl')
cars_df_full.head()
Out[38]:
region price year manufacturer model condition cylinders fuel odometer title_status transmission drive size type paint_color county state lat long posting_date
0 prescott 6000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN az NaN NaN NaN
1 fayetteville 11900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ar NaN NaN NaN
2 florida keys 21000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN fl NaN NaN NaN
3 worcester / central MA 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ma NaN NaN NaN
4 greensboro 4900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN nc NaN NaN NaN

Initial Clean-up¶

In [39]:
#drop some missing data and columns we won't use
cars_df = cars_df_full.dropna(how='any', subset=['price', 'year', 'manufacturer', 'model', 'odometer']).reset_index(drop=True).drop(columns=['lat', 'long', 'posting_date', 'county', 'size', 'paint_color', 'region'])

#discard cars w/o price
cars_df = cars_df[cars_df['price'] > 0]

#remove price outliers (top 10 and bottom 10 deciles)
cars_df['decile'] = pd.qcut(cars_df['price'], 10, labels=range(1,11))
cars_df = cars_df[cars_df['decile'].isin([1,10])==False]
cars_df = cars_df.drop(columns='decile')

#remove odometer 1% outliers
cars_df['odometer_centile'] = pd.qcut(cars_df['odometer'], 100, labels=range(1,101))
cars_df = cars_df[cars_df['odometer_centile'] <= 99].drop(columns='odometer_centile')

#remove models that appear less than 5 times
counts_df = cars_df['model'].value_counts().reset_index().loc[:,['index', 'model']]
model_counts_dict = dict(zip(counts_df['index'], counts_df['model']))
cars_df['model_count'] = cars_df['model'].map(model_counts_dict)
cars_df = cars_df[cars_df['model_count'] >= 5]
In [40]:
#Check % of missingness in remaining df
round(cars_df.isna().sum() * 100 / len(cars_df),2)
Out[40]:
price            0.00
year             0.00
manufacturer     0.00
model            0.00
condition       37.08
cylinders       41.70
fuel             0.31
odometer         0.00
title_status     1.32
transmission     0.42
drive           30.06
type            19.74
state            0.00
model_count      0.00
dtype: float64

Imputation + Basic Encoding¶

In [41]:
cylinders_df = cars_df[['model', 'cylinders']]
cylinders_df.dropna(inplace=True)
cylinders_df.drop_duplicates()
cylinders_dict = dict(zip(cylinders_df['model'], cylinders_df['cylinders']))
cars_df['cylinders'] = cars_df['model'].map(cylinders_dict)
cars_df['cylinders'] = cars_df['cylinders'].str.extract('(\d{1,2})')
cars_df['cylinders'] = pd.to_numeric(cars_df['cylinders'], downcast='signed')

drive_df = cars_df[['model', 'drive']]
drive_df.dropna(inplace=True)
drive_df.drop_duplicates()
drive_dict = dict(zip(drive_df['model'], drive_df['drive']))
cars_df['drive'] = cars_df['model'].map(drive_dict)

type_df = cars_df[['model', 'type']]
type_df.dropna(inplace=True)
type_df.drop_duplicates()
type_dict = dict(zip(type_df['model'], type_df['type']))
cars_df['type'] = cars_df['model'].map(type_dict)

fuel_dict = dict(zip(cars_df.groupby('fuel').mean().reset_index()['fuel'].tolist(), list(reversed(range(5)))))
cars_df['fuel'] = cars_df['fuel'].map(fuel_dict)

cars_df['title_status'] = cars_df['title_status'].fillna('missing')

title_dict = {
    'clean': 5,
    'lien': 4,
    'rebuilt': 3,
    'salvage': 2,
    'missing': 1,
    'parts only': 0 
}

cars_df['title_status'] = cars_df['title_status'].map(title_dict)


condition_dict = {
    'new': 5,
    'like new': 4,
    'excellent': 3,
    'good': 2,
    'fair': 1,
    'salvage': 0
}
cars_df['condition'] = cars_df['condition'].map(condition_dict)
cars_df['condition'] = cars_df['condition'].fillna(0)
In [42]:
#Check missingness after imputation
round(cars_df.isna().sum() * 100 / len(cars_df),2)
Out[42]:
price           0.00
year            0.00
manufacturer    0.00
model           0.00
condition       0.00
cylinders       8.33
fuel            0.31
odometer        0.00
title_status    0.00
transmission    0.42
drive           3.87
type            0.72
state           0.00
model_count     0.00
dtype: float64
In [43]:
#drop duplicates across the board
cars_df.drop_duplicates(inplace=True)

cars_df.reset_index(drop=True, inplace=True)

cars_df.head()
Out[43]:
price year manufacturer model condition cylinders fuel odometer title_status transmission drive type state model_count
0 33590 2014.0 gmc sierra 1500 crew cab slt 2.0 8.0 2.0 57923.0 5 other 4wd pickup al 185
1 22590 2010.0 chevrolet silverado 1500 2.0 8.0 2.0 71229.0 5 other 4wd pickup al 3574
2 30990 2017.0 toyota tundra double cab sr 2.0 8.0 2.0 41124.0 5 other 4wd pickup al 328
3 15000 2013.0 ford f-150 xlt 3.0 8.0 2.0 128000.0 5 automatic 4wd pickup al 325
4 27990 2012.0 gmc sierra 2500 hd extended cab 2.0 8.0 2.0 68696.0 5 other 4wd pickup al 165

Check Distributions¶

In [140]:
fig, axes = plt.subplots(3,5, figsize=(20,15))
sample_data = cars_df.sample(frac=.1, random_state=0)

#replace these vars with numbers to make cleaner plots and avoid overlapping labels. We just want to see distributions after all
for var in ['manufacturer', 'model', 'type', 'state']: 
    sample_data[var] = sample_data[var].map(dict(zip(sample_data[var].unique(), range(len(sample_data[var].unique())))))

for col, ax in zip(cars_df.columns, axes.ravel()):
    sns.histplot(data=sample_data[col], ax=ax).set_title(col, size=15)
    ax.set_xlabel('')
    ax.set_ylabel('')

plt.subplots_adjust(hspace=.5, wspace=.3)

Check Correlations¶

In [46]:
#Scatterplots
corr_vars = ['price', 'year', 'condition', 'cylinders', 'odometer', 'title_status']

vars_a, vars_b = [], []

for i in corr_vars:
    for j in corr_vars:
        vars_a.append(i)
        vars_b.append(j)

vars_df = pd.DataFrame({'a': vars_a, 'b': vars_b})

fig, axes = plt.subplots(len(corr_vars),len(corr_vars), figsize=(15,15))
for i, ax in zip(range(len(vars_df)), axes.ravel()):
    x = vars_df.iloc[i]['a']
    y = vars_df.iloc[i]['b']
    sns.scatterplot(data=sample_data, x=x, y=y, ax=ax, s=2, alpha=.1).set_title(f'{x}\nvs\n{y}', fontsize=10)
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.set_xticklabels('')
    ax.set_yticklabels('')
    ax.legend([],[], frameon=False)

plt.subplots_adjust(hspace=1.2, wspace=.5)
In [47]:
#Heatmaps for Pearson and Spearman coefficients
cars_corr = cars_df.loc[:,corr_vars].corr() 
cars_corr2 = cars_df.loc[:,corr_vars].corr(method='spearman') 
corr_labels = round(cars_corr, 2)
corr_labels2 = round(cars_corr2, 2)

fig, axes = plt.subplots(1,2,figsize=(18,8))
sns.heatmap(cars_corr, annot=corr_labels, cmap='vlag', center=0, ax=axes[0], cbar=False).set_title('Pearson', size=15)
sns.heatmap(cars_corr2, annot=corr_labels2, cmap='vlag', center=0, ax=axes[1], cbar=False).set_title('Spearman', size=15)

plt.subplots_adjust(wspace=.3)

Import the Model and Predict With It¶

In [48]:
from sklearn.model_selection import train_test_split

#Work with sample
cars_df = cars_df.sample(frac=.1, random_state=0).reset_index(drop=True)

#Split data
X = cars_df.drop(columns=['price', 'manufacturer', 'model', 'transmission', 'drive', 'type', 'state', 'model_count'])
y = cars_df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7, random_state=0)
In [109]:
# Load the model from the file
import joblib
xgb_model = joblib.load('best_model_final.pkl')

# Predict with the  model
y_pred = xgb_model.predict(X_test)

Evaluate the Model and Interpret the Outputs¶

Basic Performance Metrics¶

For regression tasks, there are three basic performance metrics that we are all taught to look into, namely the Mean Squared Error (MSE), the coefficient of determination (R2), and the Mean Absolute Error (MAE). There is frequent confusion between MSE and MAE because, due to their names, they appear similar, or bound to show something similar and thus people are prone to pick one over the other, or think one is better than the other. This is however not the case, as they are focused on shedding light on different things, albeit they all are calculated from the same inputs - namely the difference (errors) between the predictions of the model over test, unseen data, and the real target values corresponding to that test, unseen data.

While the main difference between the MSE and the MAE is mathematical, the observed numerical effect is largely due to how the mean behaves as a metric: it is very sensitive to extreme values. Since the MSE squares the errors, it naturally and significantly amplifies larger errors in predictions, which are for the most part associated with outliers in the data, therefore creating larger MSE values. On the other hand, the MAE does not square the errors, which gives us a better general indication of the amount of error in the predictions.

Finally, the (R2) score

represents the proportion of variance (of y) that has been explained by the independent variables in the model [and] provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance. (sklearn)

Which is as straightforward as it gets. While the rule of thumb indicates that we need to minimise MSE and MAE and maximise (R2), in practice we can have models with large MSEs and low MAEs, and the (R2) can be as good or as bad as it gets. In the case of the present model, the MSE is large while the MAE is low, and the (R2) is around 0.74.

In [110]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2= r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f'MSE:\t{mse:.1f}\nR2:\t{r2:.3f}\nMAE:\t{mae:.1f}')
MSE:	23629481.2
R2:	0.741
MAE:	3639.6

This means that:

  • the performance of the model is fair (reasonable MAE), but need to be improved: it tends to produce errors that are not too distant from the target variable. The important word here however is tends, because
  • the model contains a still important number of outliers creating significant errors (high MSE), and
  • that the features injected in the model - that is, the independent variables of which the dependent variable (car price, y) depends on, like odometer, condition, etc, all contained in X - do a good job at capturing the target (price of the car) (good R2).

This is, of course, heavily subject to personal interpretation. Notwithstanding this, the above hits home when we plot the real prices in the test set against the predicted prices for the same portion of the data:

In [111]:
fig, axes = plt.subplots(1,1,figsize=(7,5))

sns.scatterplot(x=y_test, y=y_pred, alpha=.1, s=10, ax=axes).set_title('Real vs Predicted Prices (Test Set)', size=15)

plt.xlabel('Real Prices (y_test)')
plt.ylabel('Predicted Prices (y_pred)');

Although this is one of the standard ways to visualise regression performance, it is understandably not the only way to visualise the aforementioned interpretations. Plotting the distribution of errors and the absolute value of those errors we can see from a different perspective how the model is behaving: the bulk of the errors is between 0 and 5,000 dollars in either direction, but with some important outliers north of 10,000.

In [112]:
price_diff = y_test - y_pred

price_errors = pd.DataFrame({
    'Real Price': y_test,
    'Predicted Price': y_pred,
    'Price Difference': price_diff,
    'Absolute Difference': abs(price_diff)
})

fig, axes = plt.subplots(1,2,figsize=(17,5))

sns.histplot(data=price_errors, x='Price Difference', bins=100, ax=axes[0]).set_title('Distribution of Prediction Errors (Test Set)', size=15);
sns.histplot(data=price_errors, x='Absolute Difference', bins=100, ax=axes[1]).set_title('Distribution of Prediction Absolute Errors (Test Set)', size=15);

This way we can also quickly observe where our most significant outliers are. I'll save one so we can investigate it later:

In [113]:
outlier_index = price_errors[price_errors['Absolute Difference'] == price_errors['Absolute Difference'].max()].index[0]

price_errors[price_errors['Absolute Difference'] == price_errors['Absolute Difference'].max()]
Out[113]:
Real Price Predicted Price Price Difference Absolute Difference
3426 5999 30880.935547 -24881.935547 24881.935547

Another way to look into the current state of the model in terms of performance is to calculate the frequency of the errors to see how often we are likely to make a mistake of a given magnitude when using the model to predict the price of a car. It is easier to visualise this (and to communicate it to stakeholders) if we bin the errors in groups that make sense from a business perspective:

In [141]:
price_errors['Error Bin'] = [
    '-20k+' if i <= -20000 else 
    '-20k to -10k' if i > -20000 and i <= -10000 else
    '-10k to -5k' if i > -10000 and i <= -5000 else
    '-5k to -1k' if i > -5000 and i <= -1000 else
    '-1k to -500' if i > -1000 and i <= -500 else
    '-500 to -100' if i > -500 and i <= -100 else
    '-100 to 0' if i > -500 and i <= 0 else
    '0 to 100' if i > 0 and i <= 100 else
    '100 to 500' if i > 100 and i <= 500 else
    '500 to 1k' if i > 500 and i <= 1000 else
    '1k to 5k' if i > 1000 and i <= 5000 else
    '5k to 10k' if i > 5000 and i <= 10000 else
    '10k to 20k' if i > 10000 and i <= 20000 else
    '20k+'
    for i in price_errors['Price Difference']
]

price_errors_grouped = price_errors.groupby('Error Bin').count().reset_index().iloc[:,:2]
price_errors_grouped.columns = ['Error Bin', 'Num Errors']
price_errors_grouped['Percentage'] = price_errors_grouped['Num Errors'] * 100 / price_errors_grouped['Num Errors'].sum()

fig, axes = plt.subplots(1,1,figsize=(7,5))

error_bin_order = price_errors.sort_values(by='Price Difference')['Error Bin'].unique().tolist()

sns.barplot(
    data=price_errors_grouped,
    x='Percentage',
    y='Error Bin',
    ax=axes,
    order=error_bin_order,
    color=sns.color_palette()[0]
).set_title('Frequency and Size of Errors (Test Set)', size=15);

for container in axes.containers:
    values = container.datavalues
    bar_labels = [price_errors_grouped[price_errors_grouped['Percentage']==value]['Percentage'].tolist()[0] for value in values]
    axes.bar_label(container, 
                  labels = [f'{i:.2f}%' for i in bar_labels],
                  padding=5,
                  size=11)

plt.xlim(0,45)

xlabels = axes.get_xticks()
axes.set_xticks(ticks=xlabels, labels=[f'{i:.0f}%' if i%20==0 else '' for i in xlabels]);

plt.ylabel('Prediction Error');
plt.xlabel('Frequency');

As these frequencies can intuitively be transformed into probabilities, this way of visualising the performance of the model makes it easy for non-technical stakeholders to evaluate the risks of under- or over-appreciation of a car when utilising this model as it stands. In this case, it means that, most of the time, the model will under- or over-value the car between 1,000 and 5,000 dollars. But that is a decision that needs to be made in the context of risk appetite and what is gained by the utilisation of the model being developed. It is basically a business decision.

Now, understandably we could go back and improve the model, and iterate over and over to improve our metrics and reduce those outliers, but that's not the purpose of this project, so we will move onto deriving more insights from the model as it stands.

Global Feature Importance¶

Just because the model shows good performance when using the variables (features) contained in X to predict y, it does not mean that all the variables contained in X are equally important for the model. There are multiple ways to visualise feature importance, but the most common ones are Mean Decrease in Impurity (MDI, can also be found as Gini in some implementations especially in classification tasks), Permutation, Shap*, and Predictive Power Score (PPS) - although this last one is simply a way to detect relationships between features and the target and the implementation does not require a training a model. They all have pros and cons and the interpretation depends on the model and task at hand, but it is usually useful to observe at least 2 of them to see if the same features rank high across evaluations, which is a good indication that the variable is indeed important for the model. Let us visualise the main plots and then we can explain and compare them.

*note that an alternative to Shap is LIME, but I omitted it since it does not work well out of the box with XGBoost. Also, Shap offers very similar model-agnostic tools (and much prettier visualisations) so there is no inherent advantage in leveraging both.

In [115]:
### Calculations
#MDI
mdi_importances = pd.DataFrame({
    'Feature': xgb_model.feature_names_in_, 
    'Importance': xgb_model.feature_importances_
}).sort_values(by='Importance', ascending=False).reset_index(drop=True)

# Permutation
from sklearn.inspection import permutation_importance

perm_importances = permutation_importance(
    xgb_model, X_test, y_test, n_repeats=10, random_state=0, n_jobs=-1
)
sorted_importances_idx = perm_importances.importances_mean.argsort()

permutation_importances = pd.DataFrame(
    perm_importances.importances[sorted_importances_idx].T,
    columns=X.columns[sorted_importances_idx],
)

#Shapley
# These three lines that are required to create this and some other plots later take too long to run on this model and dataset, so I'll simply load the saved results from file
explainer = shap.TreeExplainer(xgb_model) 
shap_values = explainer.shap_values(X)
explainer_obj = explainer(X)

#Predictive Power Score (PPS)
import ppscore as pps 

pps_df = X_test.copy()
pps_df['price'] = y_test
pps_scores = pps.predictors(pps_df, 'price')
In [116]:
### Show all plots
plt.subplots(2,2,figsize=(10,5))

#MDI
plt.subplot(2,2,1)
sns.barplot(data=mdi_importances, x='Importance', y='Feature', color=sns.color_palette()[0]);
plt.xlabel('MDI Importance')

#Permutation
plt.subplot(2,2,2)
sns.boxplot(data=pd.melt(permutation_importances, var_name='Feature', value_name='Importance').sort_values(by='Importance', ascending=False), 
            color=sns.color_palette()[0],
            x='Importance', 
            y='Feature', 
            orient='h',
            linewidth=.5)
plt.axvline(x=0, color="k", linestyle="--", linewidth=.8)
plt.xlabel('Permutation Importance')

#Shapley
plt.subplot(2,2,3)
shap.summary_plot(shap_values, X, plot_type="bar", show=False, plot_size=None, color=sns.color_palette()[0])
plt.ylabel('Feature');
plt.xlabel('Mean SHAP value')

#PPS
plt.subplot(2,2,4)
sns.barplot(data=pps_scores, y="x", x="ppscore", color=sns.color_palette()[0]);
plt.ylabel('Feature');
plt.xlabel('PPS')

plt.show()
  • Mean Decrease in Impurity (MDI) Importance: Although MDI is more often discussed - and makes more intuitive sense - in the context of classification, it can also be used to visualise the importance of features on regression tasks. Similar to the classification counterpart, the decrease in impurity relates to how much variance remains in the data after a split has been made by the tree when using a feature. In the case of regression, it means how much reduction in a target metric (e.g. MSE) occurs due to that split. Features with higher MDI importance are thus more important for the model to achieve the goal of minimising the target loss function.

  • Permutation Importance: This reflects the effect on the model's capacity to achieve its goal (minimise loss) when the values of a feature are randomly shuffled to observe the degradation of te model as a consequence. The higher the permutation importance (higher decrease in score) the more the model suffers from the variable not being present.

  • Shapley: Shapley values stem from cooperative game theory and were developed to determine how much a given feature contributes to an outcome (prediction), as if the features were players in a game and the prediction (in this case the price of the car) was the outcome of the match. Although Shapley values are calculated for each features x price interactions (i.e. for each prediction, or each game the players play), global, model-level values can be computed as well (such as in this visualisation). The higher the mean Shap value, the more "valuable" the feature is for the model.

  • Prediction Power Score (PPS): The PPS reflects whether there is an underlying relationship between the features and the target that might not be evident from simple scatterplots or correlation matrices, and therefore can be used to check what features have a strong relationship with the target. The PPS can be calculated directly from the original dataframe (therefore it is model-independent) and can also help reveal relationships between features, similar to the correlation matrix. Features with high PPS indicate a strong relationship with the target.

These methods have pros and cons, and the selection of the method will depend on the task, model, and objective at hand. Nonetheless, it is always useful to use at least two methods to draw conclusions, as they all have their shortcomings. In my view, the most important pros and cons can be summarised as follows:

MDI Pros:

  • Good global view of feature importance across all trees.
  • Easy to interpret, especially for classification tasks.

MDI Cons:

  • Biased towards high-cardinality features (features that have a lot of unique values in the dataset), therefore some unimportant variables can rank high even when in practice they do not contribute much to the model.
  • Since the metrics are calculated from the training set, the output gives no indication of how predictive of the target the feature really is (can be high if the model used them to overfit).

Permutation Pros:

  • Excellent remedy to overcome shortcomings of MDI, as the approach forces the model to do without the feature and evaluates its performance.

Permutation Cons:

  • Slightly computationally expensive to calculate.
  • Outputs are clunky and harder to read than other implementations (at least from SKLearn's implementation).

Shapley Pros:

  • Gold standard for interpretable ML.
  • The library is rich in low-code visualisations and implementations that are report-ready and easy to read.

Shapley Cons:

  • Extremely computationally intensive to calculate, especially for larger models/datasets.
  • Plots are not easy to customise and blend with other common libraries e.g. matplotlib/seaborn.

PPS Pros:

  • Very informative and powerful to detect relationships not only between features and target, but also amongst features.
  • Can be used prior to model implementation to uncover patterns that can guide feature engineering and selection.

PPS Cons:

  • Since it's not a metric that determines how important a feature is for a model, it is arguably not a "feature importance" metric in the context of model evaluation.

In the context of this exercise, it is clear that the most important features to determine the market price of the car repeat across evaluation methods, and they relate to:

  • How old the car is: year and odometer play an active role in setting the price of the car - and it is expected that the relationship is not a strong one because the dataset also contains cars that can be classified as 'vintage' which go for much higher prices that cars that are only 10-20 years old.

  • The power of the car: as shown by cylinders.

Plot the tree!¶

Another way to inspect what the model is up to at the global level is to plot the last tree (or some of the trees in case of an ensemble), and explore it. This is helpful to see how the boosted tree (in this case) is making its splits (although the visualisation is friendlier with less complex trees):

In [117]:
from xgboost import plot_tree

last_tree_index = xgb_model.get_booster().num_boosted_rounds() - 1

fig, axes = plt.subplots(1,1,figsize=(20, 100))
plot_tree(xgb_model, num_trees=last_tree_index, rankdir='LR', ax=axes);

Partial Dependence and Individual Conditional Expectation Plots¶

To understand the most important features of the model relate to the target function (price prediction), we have a number of tools at our disposal, and a critical one is combining Partial Dependence Plots (PDPs) with Individual Conditional Expectation (ICE) plots. SKLearn documentation writes that PDPs

show the dependence between the target function and a set of features of interest, marginalizing over the values of all other features (the complement features)

While ICEs

show the dependence between the target function and a feature of interest. However, unlike partial dependence plots which show the average effect of the features of interest, ICE plots visualize the dependence of the prediction on a feature for each sample separately, with one line per sample.

This means that they make sense to be visualised together, at least most of the time. Let us plot the PDPs and ICEs for the four most important features then.

In [123]:
from sklearn.inspection import PartialDependenceDisplay

top_features = ['year', 'odometer', 'cylinders']

top_feature_indices = [np.argwhere(X_test.columns.str.contains(feature))[0][0] for feature in top_features]

fig, axes = plt.subplots(1,1,figsize=(15,5))

PartialDependenceDisplay.from_estimator(
    xgb_model, 
    X_test.sample(n=100, random_state=0), # note that I'm only plotting a smaller sample to make the plots more readable
    top_feature_indices, 
    feature_names=X_test.columns, 
    ax=axes,
    kind='both',  # 'both' to display both PDP and ICE
    ice_lines_kw={'color': 'blue', 'alpha': 0.2},  # ICE plot customization
    pd_line_kw={'color': 'red', 'linestyle': '-', 'linewidth':1, 'label': 'PDP'}  # PDP line customization
);

From this we can visualy confirm how the PDP is, effectively, the ICEs averaged, and we can see that, for example, the model also recognises that car prices tend to decrease with odometer counts, and that the year of the car model affects the price differently pre 90's and past mid 2000's: the pre 90's price increases the older the car is, indicating a "vintage" effect.

Importantly, we can notice that (in this case) while most ICEs tend to follow a very similar trajectory for each feature, not all do, which means that the model finds interactions between features - these could be worth exploring to better understand the model. SHAP's dependence plots (see below) allow us to observe similar effects.

The SKLearn implementation of PDPs also allows us to plot multiple features together, to observe how they behave in relation to the target (car price). In this case (plot on the right in the next figure), we can see how newer (more recent) cars with lower odometers tend to have higher prices (yellow top left corner). We can also see the "vintage pocket" at the bottom left corner (light green), with prices much higher than, say, cars from 1990 to 2010 with very high odometers (middle right, purple). Note that the cmap 'viridis' is the default used, and I have not found a straightforward way to change it. If you do, please reach out!

In [124]:
fig, axes = plt.subplots(1,3,figsize = (15,5))

PartialDependenceDisplay.from_estimator(
    xgb_model, 
    X_test.sample(n=100, random_state=0), # note that I'm only plotting a smaller sample to make the plots more readable
    **{'features': ['odometer', 'year', ('odometer', 'year')], 'kind':'average'},
    ax=axes,
    pd_line_kw={'color': 'red', 'linestyle': '-', 'linewidth':1, 'label': 'PDP'},  # PDP line customization,
);

Local Feature Importance¶

With the exception of ICE plots, the previous approaches can be labelled as "global" because they focus on the importance of the feature at the model level, rather than at the sample (row) level. Shapley values, and the SHAP approach, go beyond the global approach and can be used to observe the importance of the feature at the sample (local) level, offering deeper insights into the model and how to interpret it either to implement it or to refine it further. The summary plot is a natural extension of the bar plot we saw earlier, as it combines feature importance with the distribution of the impacts of those features across the dataset (please note that the default red/blue palette was changed to make these following plots easier to visualise)

In [125]:
shap.summary_plot(shap_values, X, cmap='viridis')

The summary plot shows the features (Y axis), the value of the feature (color), and the positive (right) or negative effect (left) of the value of said feature on the predicted price (X axis). From this, we can read that:

  • Newer (more recent) models tend to push the price of the car up compared to older ones.
  • Higher cylinder counts push the price up, while lower cylinder counts do the opposite.
  • Higher odometer counts push the price down, while lower counts increase it.
  • Title status appears to be more of a confounding variable, since there is no clear indication on how the feature affects price. This feature could be dropped for the next iteration.

And so on. The limitation of this plot is that a passing read might make us infer that the relationship between the Shapley values and the feature values is linear, when in fact it might not be. This can be remediated by observing the dependence plots, where the actual relationship between Shapley values and feature values is portrayed, with the added bonus that another feature is chosen to show interaction (the selection of this feature is automatic, but can be set manually as well):

In [127]:
selected_features = ['year', 'odometer', 'cylinders']

for feature in selected_features:
    shap.dependence_plot(feature, shap_values, X, cmap='viridis', alpha=.5)

This allows us to see, for example, that in our model the year of the car (year) plays a much more important role in predicting the price of the car for models from around year 2000 onward (which could also be the consequence of the original dataframe being heavily skewed towards newer cars) - therefore the relationship is not linear at all. Also, we can see that when it comes to cylinders there seems to be a more linear relationship between the Shap values for cylinders and the cars with 4 to 10 cylinders.

Since Shap also allows us to visualise sample-level explanations for each feature, it offers a suite of options to do this. The force plot is a very interesting tool to visualise the "force" of different featues on a given prediction. We then consider three cars from the dataset and observe their respective force plots. Note how, depending on the sample at hand, the features can influence the predicted price (highlighted number) by driving it up (red) or down (blue). For the following plots I picked two random cars, and added the index of the outlier we singled out earlier:

In [128]:
car_indices = [2,8,outlier_index]
cars_df.iloc[car_indices]
Out[128]:
price year manufacturer model condition cylinders fuel odometer title_status transmission drive type state model_count
2 8995 2008.0 lincoln navigator 0.0 8.0 2.0 147887.0 5 automatic 4wd SUV mi 236
8 16300 2016.0 chevrolet traverse 0.0 6.0 2.0 112657.0 5 automatic 4wd SUV mi 801
3426 5999 2017.0 chevrolet silverado 4.0 8.0 2.0 99999.0 5 automatic 4wd pickup fl 2011
In [129]:
shap.force_plot(explainer.expected_value, shap_values[car_indices[0], :], X.iloc[car_indices[0], :])
Out[129]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [130]:
shap.force_plot(explainer.expected_value, shap_values[car_indices[1], :], X.iloc[car_indices[1], :])
Out[130]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [131]:
shap.force_plot(explainer.expected_value, shap_values[car_indices[2], :], X.iloc[car_indices[2], :])
Out[131]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Similar to the force plot is the waterfall plot, that shows (with the same logic and colours as before) how the features affect the final prediction. The main difference with the force plot is that it shows you how it navigates from the naive expected prediction E[f(X)] at the bottom of the plot to the actual prediction at the top of it.

In [132]:
for i in car_indices:
    print(pd.DataFrame(cars_df.iloc[i]).T)
    shap.plots.waterfall(explainer_obj[i], max_display=8)
  price    year manufacturer      model condition cylinders fuel  odometer  \
2  8995  2008.0      lincoln  navigator       0.0       8.0  2.0  147887.0   

  title_status transmission drive type state model_count  
2            5    automatic   4wd  SUV    mi         236  
   price    year manufacturer     model condition cylinders fuel  odometer  \
8  16300  2016.0    chevrolet  traverse       0.0       6.0  2.0  112657.0   

  title_status transmission drive type state model_count  
8            5    automatic   4wd  SUV    mi         801  
     price    year manufacturer      model condition cylinders fuel odometer  \
3426  5999  2017.0    chevrolet  silverado       4.0       8.0  2.0  99999.0   

     title_status transmission drive    type state model_count  
3426            5    automatic   4wd  pickup    fl        2011  

One evident limitation of the waterfall and force plots is that it they are one-sample-at-a-time plots. If we want to see the effect of the features on multiple samples overlaid but with the same kind of interpretability (i.e. not like in the summary plot where you can't identify samples), we can turn to the decision plot. The decision plot can help visualise what the waterfall plot shows (from naive to predicted value), but with more samples being plotted, sharing the same output range for the predictions (top coloured bar), and feature order (y axis). As the name implies, it shows how the model decides how to go from naive prediction to actual prediction multiple samples (see first plot in the next section).

These sort of plots and visualisations are useful to detect outliers, misclassifications, identify common prediction paths for a good number of samples (which can help zero in on why the model made a mistake on a given sample), and more. Now we can use this approach to better understand what drove the model to create such a huge outlier earlier. Let's compare the outlier, a Chevy Silverado, with other random Silverados in the dataset.

In [134]:
outlier_model = 'silverado'
silverados_sample = pd.concat([
    cars_df[cars_df['model']==outlier_model].sample(n=10, random_state=0), # Ten random Silverados
    pd.DataFrame(cars_df.iloc[outlier_index,:]).T # our outlier, at the bottom of the df
    ])
silverados_sample
Out[134]:
price year manufacturer model condition cylinders fuel odometer title_status transmission drive type state model_count
16500 11000 2007.0 chevrolet silverado 2.0 8.0 2.0 205000.0 5 automatic 4wd pickup il 2011
5828 13998 2011.0 chevrolet silverado 0.0 8.0 2.0 150853.0 5 automatic 4wd pickup id 2011
13157 5300 1998.0 chevrolet silverado 0.0 8.0 2.0 190000.0 5 automatic 4wd pickup ma 2011
6136 7000 2002.0 chevrolet silverado 0.0 8.0 2.0 125000.0 5 automatic 4wd pickup pa 2011
10341 35999 2017.0 chevrolet silverado 0.0 8.0 2.0 47815.0 5 automatic 4wd pickup tx 2011
5594 8500 1994.0 chevrolet silverado 3.0 8.0 2.0 142262.0 5 automatic 4wd pickup tx 2011
939 31956 2017.0 chevrolet silverado 3.0 8.0 2.0 91059.0 5 automatic 4wd pickup ky 2011
528 24995 2017.0 chevrolet silverado 0.0 8.0 2.0 64250.0 5 automatic 4wd pickup dc 2011
8834 6111 2007.0 chevrolet silverado 0.0 8.0 2.0 234336.0 5 automatic 4wd pickup fl 2011
14343 26500 1984.0 chevrolet silverado 0.0 8.0 2.0 38751.0 5 automatic 4wd pickup sd 2011
3426 5999 2017.0 chevrolet silverado 4.0 8.0 2.0 99999.0 5 automatic 4wd pickup fl 2011

And we plot them all together on a decision plot (highlighting the outlier with a dashed line)

In [135]:
car_indices2 = silverados_sample.index.tolist()

r = shap.decision_plot(explainer.expected_value, shap_values[car_indices2], X.iloc[car_indices2], highlight=car_indices2.index(outlier_index), return_objects=True)

And now we can plot the outlier on its own to reveal the features and values that drive the model to make the prediction it's making.

Note that to preserve the order of features and scales as in the "collective" decision plot above, you need to assign the collective plot to a variable (in this case r) and add the argument return_objects=True so that you can call it later as I did below.

In [136]:
selected_sample = outlier_index
shap.decision_plot(
    explainer.expected_value,
    shap_values[selected_sample],
    X.iloc[selected_sample],
    feature_order=r.feature_idx,
    xlim=r.xlim,
)

The model therefore arrives at the conclusion that the price for this Chevy Silverado is over 30 grand, mainly because it's a new model with comparable odometer to others from around the same year, and it's an 8 cylinder vehicle, and therefore should have a price tag similar to the other newer Silverados (see the comparison above). But the real sale price for this one is much lower!

In [137]:
pd.DataFrame(cars_df.iloc[outlier_index,:]).T
Out[137]:
price year manufacturer model condition cylinders fuel odometer title_status transmission drive type state model_count
3426 5999 2017.0 chevrolet silverado 4.0 8.0 2.0 99999.0 5 automatic 4wd pickup fl 2011

So it's not that the model is making a mistake - rather, this data point is an anomaly that is creating noise for the model, and the model is correctly inferring that, based on all the other conditions of the car, the price should be much higher, similar to the other newer Silverados! We should therefore just drop this data point for future iterations of the model on this dataset.

Observing the effect of multiple features over a large number of predictions can get very messy on a decision plot, and on any of the plots we have covered so far. Here is where heatmaps come to the rescue (within reason), as they allow for the quick visualisation of the impact of the model features across hundreds of samples. Shap heatmaps can also be sorted so that the output variable f(x) -- in this case the predicted car price -- is in decreasing fashion, which allows to see whether there are patterns or pockets of feature values heavily influencing the output. In this case, we can see that odometer and year affect the model importantly across multiple instances, with high SHAP values pushing the predicted price f(x) higher. We can also see that in most instances fuel and title_status do not play a role in determining the outcome of the model, but there are a handful of instance where they do (and could be worth exploring).

An important limitation of the heatmap however is that the more samples one wants to plot, the thinner the bands get so if the features do not show clear patterns then the heatmap becomes very hard to read, but nonetheless it remains an excellent diagnostic tool for, say, comparing two subpopulations within the model, like a couple hundred of low error predictions versus a couple hundred outliers.

In [138]:
sample_idx = X.sample(n=1000, random_state=0).index.tolist()

shap.plots.heatmap(
    explainer(X.loc[sample_idx]), 
    instance_order=explainer(X.loc[sample_idx]).sum(1)
    );

There are some other Shap plots that are included as part of the Shap library, but they either are not applicable to the current project (e.g. monitoring plot or image plot), or provide information that is already covered in plots above (e.g. the violin plot is, for the most part, captured by the summary plot, and the latter is richer in my opinion). Nonetheless the plots we covered here provide a wealth of tools to diagnose, explain and further refine a model.

Conclusion and Future Directions¶

The purpose of this exercise was to highlight and explore some of the tools we currently have at our disposal to explain the inner workings of a complex model to non-technical stakeholders, and to ourselves as well, since it is most likely that we will have to go back to refine and fine-tune the current model to minimise the errors and the likelihood of making catastrophic errors.

From inspection of the results so far there are many hypothesis worth exploring, for example - and considering how the year of the car shows large influence over the model for cars from 2000 onward -, maybe two models could be developed and evaluated: one for newer cars and one for older/vintage ones. Additionally it would be interesting to explore what features the model leverages in cars whose price is correctly predicted that the erroneously evaluated cars do not have, or how are these different between correctly and incorrectly assessed cars.

As mentioned above, the purpose of this project is not to cover all the explainability tools we have at our disposal, but to explore the most common ones that can help us gain insights for future model improvement, for communicating results to non-technical stakeholders, and to demystify the inner workings of machine learning models. Libraries such as Shap provide beautiful out-of-the-box visualisations for explainability purposes with great functionality, although if you are like me and prefer customising the plots with predefined colors or put them in subplots, you will have a hard time (although a good start can be found here). The computational price to pay also increases dramatically with larger datasets such as the one I used here (which forced me to use samples when appropriate), but the price is worth paying considering the objects only need to be calculated once and then saved/loaded from memory.

As machine learning gets more complex and AI takes over (the world), explainability will slowly move from being a nice-to-have to outright mandatory, especially the more businesses adopt these technologies. Maybe by that time Shap will have better compatibility with Matplotlib!

Thanks for reading!