This project aims to use data from the Association of Tennis Professionals (ATP) men's league from 2009 to 2017 to build a multiple linear regression model and understand what features and variables are correlated linearly with success as a male tennis player. Rather than giving rare and valuable insight on tennis, the purpose of this project is to build and evaluate a multilinear regression model and evaluate it for pedagogic and display purposes only.
The data represents the top 1500 players within the aforementioned period. Credit for the data goes to Codecademy. Beyond ID data, the dataset includes 'offensive' (service game) statistics, 'defensive' (return game) statistics, and outcomes. A description of each column in the dataset was provided:
-Player
: name of the tennis player.
-Year
: year data was recorded.
-Aces
: number of serves by the player where the receiver does not touch the ball
-DoubleFaults
: number of times player missed both first and second serve attempts
-FirstServe
: % of first-serve attempts made
-FirstServePointsWon
: % of first-serve attempt points won by the player
-SecondServePointsWon
: % of second-serve attempt points won by the player
-BreakPointsFaced
: number of times where the receiver could have won service game of the player
-BreakPointsSaved
: % of the time the player was able to stop the receiver from winning service game when they had the chance
-ServiceGamesPlayed
: total number of games where the player served
-ServiceGamesWon
: total number of games where the player served and won
-TotalServicePointsWon
: % of points in games where the player served that they won
-FirstServeReturnPointsWon
: % of opponents first-serve points the player was able to win
-SecondServeReturnPointsWon
: % of opponents second-serve points the player was able to win
-BreakPointsOpportunities
: number of times where the player could have won the service game of the opponent
-BreakPointsConverted
: % of the time the player was able to win their opponent’s service game when they had the chance
-ReturnGamesPlayed
: total number of games where the player’s opponent served
-ReturnGamesWon
: total number of games where the player’s opponent served and the player won
-ReturnPointsWon
: total number of points where the player’s opponent served and the player won
-TotalPointsWon
: % of points won by the player
-Wins
: number of matches won in a year
-Losses
: number of matches lost in a year
-Winnings
: total winnings in USD($) in a year
-Ranking
: ranking at the end of year
This pipeline uses seaborn
, pandas
, matplotlib
, statsmodels
and sklearn
. If there is something you think could be improved, or done differently, feel free to reach out!
We'll start by importing all the required libraries, the dataset, and performing an initial exploration of the dataset to check for missing values, issues with data types, etc.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
tennis_data = pd.read_csv('tennis_stats.csv')
tennis_data.head()
Player | Year | FirstServe | FirstServePointsWon | FirstServeReturnPointsWon | SecondServePointsWon | SecondServeReturnPointsWon | Aces | BreakPointsConverted | BreakPointsFaced | ... | ReturnGamesWon | ReturnPointsWon | ServiceGamesPlayed | ServiceGamesWon | TotalPointsWon | TotalServicePointsWon | Wins | Losses | Winnings | Ranking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Pedro Sousa | 2016 | 0.88 | 0.50 | 0.38 | 0.50 | 0.39 | 0 | 0.14 | 7 | ... | 0.11 | 0.38 | 8 | 0.50 | 0.43 | 0.50 | 1 | 2 | 39820 | 119 |
1 | Roman Safiullin | 2017 | 0.84 | 0.62 | 0.26 | 0.33 | 0.07 | 7 | 0.00 | 7 | ... | 0.00 | 0.20 | 9 | 0.67 | 0.41 | 0.57 | 0 | 1 | 17334 | 381 |
2 | Pedro Sousa | 2017 | 0.83 | 0.60 | 0.28 | 0.53 | 0.44 | 2 | 0.38 | 10 | ... | 0.16 | 0.34 | 17 | 0.65 | 0.45 | 0.59 | 4 | 1 | 109827 | 119 |
3 | Rogerio Dutra Silva | 2010 | 0.83 | 0.64 | 0.34 | 0.59 | 0.33 | 2 | 0.33 | 5 | ... | 0.14 | 0.34 | 15 | 0.80 | 0.49 | 0.63 | 0 | 0 | 9761 | 125 |
4 | Daniel Gimeno-Traver | 2017 | 0.81 | 0.54 | 0.00 | 0.33 | 0.33 | 1 | 0.00 | 2 | ... | 0.00 | 0.20 | 2 | 0.50 | 0.35 | 0.50 | 0 | 1 | 32879 | 272 |
5 rows × 24 columns
tennis_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1721 entries, 0 to 1720 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Player 1721 non-null object 1 Year 1721 non-null int64 2 FirstServe 1721 non-null float64 3 FirstServePointsWon 1721 non-null float64 4 FirstServeReturnPointsWon 1721 non-null float64 5 SecondServePointsWon 1721 non-null float64 6 SecondServeReturnPointsWon 1721 non-null float64 7 Aces 1721 non-null int64 8 BreakPointsConverted 1721 non-null float64 9 BreakPointsFaced 1721 non-null int64 10 BreakPointsOpportunities 1721 non-null int64 11 BreakPointsSaved 1721 non-null float64 12 DoubleFaults 1721 non-null int64 13 ReturnGamesPlayed 1721 non-null int64 14 ReturnGamesWon 1721 non-null float64 15 ReturnPointsWon 1721 non-null float64 16 ServiceGamesPlayed 1721 non-null int64 17 ServiceGamesWon 1721 non-null float64 18 TotalPointsWon 1721 non-null float64 19 TotalServicePointsWon 1721 non-null float64 20 Wins 1721 non-null int64 21 Losses 1721 non-null int64 22 Winnings 1721 non-null int64 23 Ranking 1721 non-null int64 dtypes: float64(12), int64(11), object(1) memory usage: 322.8+ KB
tennis_data.isna().sum()
Player 0 Year 0 FirstServe 0 FirstServePointsWon 0 FirstServeReturnPointsWon 0 SecondServePointsWon 0 SecondServeReturnPointsWon 0 Aces 0 BreakPointsConverted 0 BreakPointsFaced 0 BreakPointsOpportunities 0 BreakPointsSaved 0 DoubleFaults 0 ReturnGamesPlayed 0 ReturnGamesWon 0 ReturnPointsWon 0 ServiceGamesPlayed 0 ServiceGamesWon 0 TotalPointsWon 0 TotalServicePointsWon 0 Wins 0 Losses 0 Winnings 0 Ranking 0 dtype: int64
tennis_data['Player'].value_counts()
Ivan Dodig 9 Leonardo Mayer 9 Dudi Sela 9 Evgeny Donskoy 9 Go Soeda 9 .. Gleb Sakharov 1 Vaclav Safranek 1 Alex de Minaur 1 Takao Suzuki 1 Kento Takeuchi 1 Name: Player, Length: 438, dtype: int64
tennis_data.describe()
Year | FirstServe | FirstServePointsWon | FirstServeReturnPointsWon | SecondServePointsWon | SecondServeReturnPointsWon | Aces | BreakPointsConverted | BreakPointsFaced | BreakPointsOpportunities | ... | ReturnGamesWon | ReturnPointsWon | ServiceGamesPlayed | ServiceGamesWon | TotalPointsWon | TotalServicePointsWon | Wins | Losses | Winnings | Ranking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | ... | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1721.000000 | 1.721000e+03 | 1721.000000 |
mean | 2013.646717 | 0.598053 | 0.680738 | 0.261673 | 0.479733 | 0.466432 | 97.105171 | 0.369407 | 112.003486 | 102.918071 | ... | 0.173823 | 0.342208 | 197.650203 | 0.715590 | 0.473155 | 0.599245 | 7.876816 | 9.278908 | 2.344928e+05 | 269.610691 |
std | 2.488018 | 0.054533 | 0.070422 | 0.056639 | 0.066902 | 0.068447 | 137.966077 | 0.162987 | 119.247651 | 122.761670 | ... | 0.080880 | 0.049369 | 221.208703 | 0.123287 | 0.037139 | 0.057718 | 10.183716 | 8.996450 | 2.530537e+05 | 277.341947 |
min | 2009.000000 | 0.360000 | 0.270000 | 0.000000 | 0.060000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.220000 | 0.250000 | 0.000000 | 0.000000 | 1.080000e+02 | 3.000000 |
25% | 2012.000000 | 0.570000 | 0.650000 | 0.240000 | 0.460000 | 0.440000 | 7.000000 | 0.320000 | 15.000000 | 9.000000 | ... | 0.130000 | 0.320000 | 22.000000 | 0.670000 | 0.460000 | 0.570000 | 0.000000 | 2.000000 | 4.931100e+04 | 83.000000 |
50% | 2014.000000 | 0.600000 | 0.690000 | 0.270000 | 0.490000 | 0.480000 | 34.000000 | 0.380000 | 55.000000 | 41.000000 | ... | 0.180000 | 0.350000 | 86.000000 | 0.750000 | 0.480000 | 0.610000 | 3.000000 | 5.000000 | 1.252120e+05 | 166.000000 |
75% | 2016.000000 | 0.630000 | 0.720000 | 0.290000 | 0.520000 | 0.500000 | 140.000000 | 0.430000 | 201.000000 | 172.000000 | ... | 0.220000 | 0.370000 | 348.000000 | 0.790000 | 0.500000 | 0.630000 | 13.000000 | 17.000000 | 3.500750e+05 | 333.000000 |
max | 2017.000000 | 0.880000 | 0.890000 | 0.480000 | 0.920000 | 0.750000 | 1185.000000 | 1.000000 | 507.000000 | 573.000000 | ... | 0.560000 | 0.510000 | 916.000000 | 1.000000 | 0.820000 | 0.820000 | 48.000000 | 36.000000 | 1.074562e+06 | 1443.000000 |
8 rows × 23 columns
It looks like the dataset has no missing entries and that the dtypes correspond to what we expected. Additionally, we have a good amount of players represented and a good range of rankings. However there are many players who are 'repeats' in the data since they frequently feature in the highest rankings. Since our model for linear regression relies on independent observations, we need can't work with players who have more than one entry. So before doing ahything else, we'll drop the duplicates based on the column Player
.
tennis_data = tennis_data.drop_duplicates(subset='Player')
The first thing we'll now do is perform exploratory analysis via plotting. Considering that we're interested in "success" for a tennis player, it would be useful to see what features (variables) correlate strongly with outcomes. There are four outcome
columns in the dataset: Wins
, Losses
, Winnings
and Ranking
. Since Winnings
and Ranking
are basically a function of Wins
and Losses
(at least out of sheer logic), we'll focus on the latter two to see what predicts (i.e. correlates with) these outcomes. We'll thus explore that variables from the "Offensive" and "Defensive" categories correlate strongly with the outcomes. We'll also plot the distribution of the independent variables while we're at it, and check the outlier situation for each variable.
Let's start with the "Offensive" variables.
offensive_vars = tennis_data.columns[[7,12,2,3,5,9,11,16,17,19]]
fig, axes = plt.subplots(len(offensive_vars),4, figsize=(25, 60))
fig.subplots_adjust(hspace=0.3, wspace=0.2)
sns.set(font_scale=1.3)
for var in offensive_vars:
sns.scatterplot(x=var, y='Wins', data=tennis_data, ax=axes[offensive_vars.tolist().index(var),0], alpha=0.4).set_title(f'{var} vs Wins', fontsize=18, weight='bold')
sns.scatterplot(x=var, y='Losses', data=tennis_data, ax=axes[offensive_vars.tolist().index(var),1], alpha=0.4).set_title(f'{var} vs Losses', fontsize=18, weight='bold')
sns.histplot(x=var, data=tennis_data, ax=axes[offensive_vars.tolist().index(var),2]).set_title('Distribution', fontsize=18, weight='bold')
sns.boxplot(x=var, data=tennis_data, ax=axes[offensive_vars.tolist().index(var),3]).set_title('Outliers', fontsize=18, weight='bold')
We'll do the same for the "Defensive" variables now.
defensive_vars = tennis_data.columns[[4,6,10,8,13,14,15,18]]
fig, axes = plt.subplots(len(defensive_vars),4, figsize=(25, 50))
fig.subplots_adjust(hspace=0.3, wspace=0.2)
sns.set(font_scale=1.3)
for var in defensive_vars:
sns.scatterplot(x=var, y='Wins', data=tennis_data, ax=axes[defensive_vars.tolist().index(var),0], alpha=0.4).set_title(f'{var} vs Wins', fontsize=18, weight='bold')
sns.scatterplot(x=var, y='Losses', data=tennis_data, ax=axes[defensive_vars.tolist().index(var),1], alpha=0.4).set_title(f'{var} vs Losses', fontsize=18, weight='bold')
sns.histplot(x=var, data=tennis_data, ax=axes[defensive_vars.tolist().index(var),2]).set_title('Distribution', fontsize=18, weight='bold')
sns.boxplot(x=var, data=tennis_data, ax=axes[defensive_vars.tolist().index(var),3]).set_title('Outliers', fontsize=18, weight='bold')
This gives us a nice overview of the shape of the variables and their relationship to our two most important outcomes: Wins
and Losses
.
From a quick glance at this figures, we can see that there are four independent variables that show a good linear correlation with both outcomes of interest: two "offensive" variables (BreakPointsFaced
and ServiceGamesPlayed
), and two "defensive" variables (BreakPointsOpportunities
and ReturnGamesPlayer
), with BreakPointsFaced
and Wins
showing the weakest association of this lot. They also have very few outliers.
But we have a big problem: these variables correlate positively with both Wins
and Losses
, meaning that they are useless to understand what we should focus on (if we were tennis players) if we wanted to maximise our wins while avoiding losses.
Additionally, if we rehash from the description of these variables:
-BreakPointsFaced
: number of times where the receiver could have won service game of the player.
-ServiceGamesPlayed
: total number of games where the player served.
-BreakPointsOpportunities
: number of times where the player could have won the service game of the opponent.
-ReturnGamesPlayed
: total number of games where the player’s opponent served.
It appears as if the relationship between these variables and the chosen outcomes become a sort of "the more you play, the more you can either win or lose" (which is basically useless insight), and that we might have some collinearity with some of these features when we evaluate the model (we'll evaluate this later).
Another option we have is to explore the relationship of the same variables with the other outcomes that we sidelined (Winnings
and Ranking
). We'll do this now so we can keep exploring in order to choose the best features to build our model with hopefully some interesting insights. This time, however, we'll visualise the relationships between all variables using a correlation matrix visualised as a heatmap.
data = tennis_data.drop(['Player', 'Year'], axis=1).corr()
labels = round(data,2)
sns.set(font_scale=0.85, rc={'figure.figsize': (12,12)})
sns.heatmap(data, cmap='vlag', center=0, annot=labels).set_title('Tennis Data Linear Relationships Heatmap', weight='bold', fontsize=18)
plt.show()
The .corr()
method calculates the correlation of the variables (using Pearson's approach by default) and outputs a correlation matrix that contains Pearson's coefficients for each variable in the dataset compared to each other. Visualising this correlation matrix as a heatmap is a convenient way to quickly check linear relationships between variables when the dataset contains too many variables and a pairplot woul give you an output that's too large and cumbersome to read. Additionally, using this method allows us to see the numerical values of said relationships (so we don't have to guess how strong a relationship is). The downside of this approach is that it will not show you other kinds of relationships that scatterplots might make evident, so be careful!
In the case of this dataset, there appears to be no variable that correlates uniquely with either Wins
, Losses
, Winnings
or Ranking
. In fact, the variables that correlate most strongly with Wins
also correlate as strongly with Losses
and Winnings
, unfortunately. From this perspecttive, it appears that we will not be getting much insight into what predicts either outcome, and this dataset might just be the reflection of the non-linear relationships in the data at this level of performance in tennis playing.
Nonetheless, we will build and evaluate some linear models to understand the underlying assumptions of the models and common pitfalls to avoid.
Linear regression, although simple to understand and powerful in many cases, relies on several underlying assumptions that need to be met if the model is going to be an accurate representation of the data, and if it's going give us appropriate predictive power. These assumptions are:
-Independence of observations.
-No hidden or missing variables.
-Linear relationship.
-Normality of residuals.
-No or little multicollinearity.
-Homoscedasticity.
-Independent variables are uncorrelated with the error term.
-Observations of error term are uncorrelated with each other.
Let's explore them one by one and use the dataset at hand to check if the model we will build passes the test.
Independence of observations: Simply put, this means that one observation is no related to another in some obvious manner (e.g. entries from the same patient in the case of medical data or vendor in the case of sales data). In the case of our dataset, data from the same player being more than one entry due to the player being in the top rankings more than one year. We have dealt with this issue earlier when used the .drop_duplicates()
method using the Player
column as reference. If the data had no ID column to it (or was anonymised for ethical purposes), we need to check the methodology under which the data was collected to see if this assumption has been met for the dataset at hand.
No hidden or missing variables: Basically, that you have (hopefully) collected all the relevant variables that predict an outcome, otherwise the model will simply not be a good reflection of what variables predict the outcome. It is hard to check for this assumption since it's more effectively a question of methodology design rather than of data evaluation, but if you end up with the wrong model -- or at the very least an incomplete one -- even after doing everything right this might be a plausible explanation and consideration for future research approaches.
Linear relationship: This seems obvious, but it is important to check that the independent variable(s) and the outcome share a linear relationship. If you can see by eye in a scatterplot that the relation is not linear (e.g. it is exponential, polynomial, etc.) and you still apply a linear regression model to the data, you are simply applying the wrong model. This is why the initial exploratory data is critical to see if a linear model might be appropriate (or at least not totally misguided). Have a look again at the scatterplots and the heatmap to check what variables share (or do not share) a linear relationship with the outcome variables. As a rule of thumb, a Pearson's score above 0.7 (for positive linear relationships) or below -0.7 (for negative linear relationships) can be considered a strong linear relationship.
To check the next set of assumptions we need to actually build a model and predict some values using the model and a validation dataset. As convention, we will use X
as the collection of independent variables and y
as the outcome (dependent/response) variable.
IMPORTANT NOTE: The initial model will contain all the variables that have a high linear correlation with the outcome Winnings
, and will thus violate many assumptions for the sake of the exercise, which we will correct later.
X = tennis_data[['Aces', 'BreakPointsFaced', 'BreakPointsOpportunities', 'DoubleFaults', 'ReturnGamesPlayed','ServiceGamesPlayed']]
y = tennis_data['Winnings']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state=100)
lm = LinearRegression()
lm.fit(X_train, y_train);
Now let's get back to checking our assumptions.
Normality of residuals: Residuals are the difference between the values predicted from the model from the response variable (outcome) in the validation data, and the real outcome values from the validation data. If the values are not normally distributed, it means that the model is not appropriate for the data at hand (e.g. non-linear relationship, multiple linear relationships rather a single one, etc). For this, we first need to calculate the residuals:
y_pred = lm.predict(X_test) #Predicted outcome values using the model and the test data
residuals = y_test - y_pred
And now we plot them:
fig = plt.figure(figsize=(5,5))
plt.hist(residuals, bins=20)
plt.title('Distribution of Residuals', fontsize=12, weight='bold')
plt.ylabel('Frequency', fontsize=10)
plt.show()
The distribution doesn't look close to normal: it is heavily centered and it has extra weight on the tails. This might indicate that there is a problem with the model. Another way to look at the residuals is what is known as a QQ-plot, which graphs the relationship between the quantiles in the sample data against the theoretical quantiles that the data would have if it followed a perfectly normal distribution.
import statsmodels.api as sm
with plt.rc_context():
plt.rc("figure", figsize=(5,5))
fig = sm.qqplot(residuals, line='r') #by default compares to normal distribution, but can be changed
plt.title('QQ-Plot of Residuals', fontsize=12, weight='bold')
plt.show()
A QQ-plot of the residuals of a model that follows a normal distribution would show the points fitting the line to a good degree. This is not the case, further confirming that the model has some issues as the residuals do not distribute normally. The "weight at the tails" we noted in the previous histogram is also showed here as outliers at both ends of the line. This calls to review the model, but we will do that after checking for all the other assumptions that we are missing.
No or little multicollinearity: Collinearity is when two variables are linearly and highly correlated, and they are both used to predict the outcome of choice. Multicollinearity is when the variable is highly correlated with two or more other independent variables. This is important to check for a number of reasons:
More variables predicting an outcome is not always good if they are related to each other, as it means the explanatory power of one variable cannot be disentangled from the other variable. Put differently, the model would not know what variable is more important than the other when predicting the outcome, rendering the model redundant. By way of example, if we wanted to predict heart disease status in people and had the variables "height", "weight" and "BMI" in our dataset, it is highly likely that these variables will present (multi)collinearity, since they are related (and BMI is effectively a function of height and weight).
It introduces noise in the model, as each variable comes with its own error when fitting, so the more variables we have, the higher the total error of the model, which is especially unnecessary when one of the variables could be dropped without losing predictive power.
A good way to check for (multi)collinearity is the Variance Inflation Factor (VIF). The VIF shows "the strength of the linear relationship between the variable and the remaining explanatory variables." and how much the regression coefficient of the variable at hand is 'inflated' due to the collinearity. This also means that the relationship of the variable is affected by the presence of the other variables, and thus makes the predictions less accurate. It is an error we need to minimise!
The VIF can be calculated using statmodels
:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame({'Feature': X_test.columns, 'VIF': [variance_inflation_factor(X_test.values, i) for i in range(len(X_test.columns))]})
vif_data
Feature | VIF | |
---|---|---|
0 | Aces | 16.079797 |
1 | BreakPointsFaced | 76.148170 |
2 | BreakPointsOpportunities | 43.635450 |
3 | DoubleFaults | 16.608583 |
4 | ReturnGamesPlayed | 7380.594760 |
5 | ServiceGamesPlayed | 7620.490832 |
This shows that there is (impressive) multicollinearity in our model, understandably: since playing a Return Game means by default that you will play a Service Game and vice versa, these two variables are linked and are interdependent. We will remove them in our next iterations. Similar story with the variables related to Break Points.
Ideally, our model should show VIFs close to 0 for all variables (the closer to 0 the better), but less than 5 is acceptable. From 5 to 10 is undesirable and over 10 is unacceptable and would mean that there is something fundamentally wrong with the model.
Homoscedasticity: This means that the error (difference) between the predicted outcome values and the actual outcome values is constant. This can be quickly checked with a scatterplot of the residuals against the data of the outcome variable. We should expect that the points show constant variation with respect to the zero line.
fig = plt.figure(figsize=(5,5))
plt.scatter(y_test, residuals, alpha=0.4, s=15)
plt.plot(y_test, [0]*len(y_test), c='r')
plt.ylabel('Error size')
plt.xlabel('Outcome variable (Winnings, in millions)')
plt.title('Variance in Error of Residuals', fontsize=12, weight='bold')
plt.show()
In this case, the pattern is clearly not homoscedastic (i.e. it is heteroscedastic).
Independent variables are uncorrelated with the error term: Basically, that there are no glaringly obvious patterns and relationships between the independent variables and the residuals.
fig, axes = plt.subplots(2,3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.3, wspace=0.22)
sns.set(font_scale=0.8)
subplts = [axes[0,0], axes[0,1], axes[0,2], axes[1,0], axes[1,1], axes[1,2]]
colnames = X_test.columns
for i in range(len(colnames)):
sns.scatterplot(x=X_test[colnames[i]], y=residuals, ax=subplts[i], alpha=0.4).set_title(f'{colnames[i]} vs \nResiduals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test[colnames[i]], y=[0]*len(X_test[colnames[i]]), ax=subplts[i], color='r')
subplts[i].set_yticks([-300000, -200000, -100000, 0, 100000, 200000, 300000],['-300K', '-200K', '-100K', '0', '100K', '200K', '300K'])
subplts[i].set_ylabel('Residuals (Winnings)')
Here we do not see any correlation between the variables and the residuals, but the fact that the same pattern repeats between all variables and the residuals is probably another indication of the collinearity we observed numerically with the VIF.
Observations of error term are uncorrelated with each other: Basically, that there are no patterns in the residuals themselves (similar to the check for normality of residuals. This means that one observation of the error term (residual) should not predict the next one. We want to observe randomness instead of patterns, cycles, and so on.
fig = plt.figure(figsize=(12,5))
sns.lineplot(x=residuals.index, y=residuals, linewidth=1)
plt.xlabel('Index of Residual')
plt.ylabel('Residuals (Winnings)')
plt.yticks([-300000, -200000, -100000, 0, 100000, 200000, 300000],['-300K', '-200K', '-100K', '0', '100K', '200K', '300K'])
plt.title('Autorrelation in Residuals', fontsize=12, weight='bold')
plt.show()
This assumption is also met.
Now that we know where the model is failing, we will go back and correct these problems, one by one. We Will start with the most glaring error: the levels of multicollinearity. Once we fix this we will re-evaluate if the new model complies with our assumptions. We will start by re-creating the model, but omitting the variables that had the highest VIF values: ReturnGamesPlayed
and ServiceGamesPlayed
.
X = tennis_data[['Aces', 'BreakPointsFaced', 'BreakPointsOpportunities', 'DoubleFaults']]
y = tennis_data['Winnings']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state=100)
lm.fit(X_train, y_train);
vif_data = pd.DataFrame({'Feature': X_test.columns, 'VIF': [variance_inflation_factor(X_test.values, i) for i in range(len(X_test.columns))]})
vif_data
Feature | VIF | |
---|---|---|
0 | Aces | 4.030206 |
1 | BreakPointsFaced | 27.177893 |
2 | BreakPointsOpportunities | 19.802520 |
3 | DoubleFaults | 14.101704 |
That fixed the VIF values for most of our variables. But we still have variables with high VIF values. Let's remove the next set of evidently collinear variables: BreakPointsFaced
and BreakPointsOpportunities
.
X = tennis_data[['Aces', 'DoubleFaults']]
y = tennis_data['Winnings']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state=100)
lm.fit(X_train, y_train);
vif_data = pd.DataFrame({'Feature': X_train.columns, 'VIF': [variance_inflation_factor(X_train.values, i) for i in range(len(X_train.columns))]})
vif_data
Feature | VIF | |
---|---|---|
0 | Aces | 4.543423 |
1 | DoubleFaults | 4.543423 |
This is much better. We can be happy with that and move forward with checking if the model passes the rest of the tests. This time, without the commentary.
y_pred = lm.predict(X_test)
residuals = y_test - y_pred
fig, axes = plt.subplots(2,3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.3, wspace=0.22)
sns.set(font_scale=0.8)
#Distribution of residuals
sns.histplot(residuals, ax=axes[0,0]).set_title('Distribution of Residuals', fontsize=12, weight='bold')
sm.qqplot(residuals, line='r', ax=axes[0,1], markersize=4)
axes[0,1].set_title('QQ Plot', fontsize=12, weight='bold')
#Homoscedasticity
sns.scatterplot(x=y_test, y=residuals, ax=axes[0,2], alpha=0.4).set_title('Homoscedasticity', fontsize=12, weight='bold')
axes[0,2].set_ylabel('Residuals')
sns.lineplot(x=y_test, y=[0]*len(y_test), ax=axes[0,2], color='r');
#Correlation between independent variables and residuals
sns.scatterplot(x=X_test['Aces'], y=residuals, ax=axes[1,0], alpha=0.4).set_title('Relationship between \nAces and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['Aces'], y=[0]*len(X_test['Aces']), ax=axes[1,0], color='r');
sns.scatterplot(x=X_test['DoubleFaults'], y=residuals, ax=axes[1,1], alpha=0.4).set_title('Relationship between \nDoubleFaults and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['DoubleFaults'], y=[0]*len(X_test['DoubleFaults']), ax=axes[1,1], color='r');
#Autocorrelation in residuals
sns.lineplot(x=residuals.index, y=residuals, ax=axes[1,2]).set_title('Autocorrelation of Residuals', fontsize=12, weight='bold');
axes[1,2].set_ylabel('Residuals')
subplts = [axes[0,1], axes[0,2], axes[1,0], axes[1,1], axes[1,2]]
colnames = X_test.columns
for i in range(len(subplts)):
subplts[i].set_yticks([-400000, -200000, 0, 200000, 400000],['-400K', '-200K', '0', '200K', '400K'])
Sometimes fixing multicollinearity fixes other issues. This time it did not.
Our next clue to fix our model is to look at the distribution of the residuals in the histogram and the QQ plot. It appears as if the presence of the most extreme values in the residuals throw the model a bit overboard. Recall from the boxplot of our initial EDA that there are some extreme values in both Aces
and DoubleFaults
.
The problem is that Scikit Learn's LinearRegression
does not have manny in-built functions to diagnose influence of outliers and other summary tables. We can use statsmodels
for this, although we have to re-create our linear regression model.
X_model = sm.add_constant(X_train)
sm_model = sm.OLS(y_train, X_model).fit()
sm_model.summary()
Dep. Variable: | Winnings | R-squared: | 0.813 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.812 |
Method: | Least Squares | F-statistic: | 659.2 |
Date: | Tue, 11 Oct 2022 | Prob (F-statistic): | 4.34e-111 |
Time: | 18:07:23 | Log-Likelihood: | -3963.6 |
No. Observations: | 306 | AIC: | 7933. |
Df Residuals: | 303 | BIC: | 7944. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 4.058e+04 | 6876.396 | 5.901 | 0.000 | 2.7e+04 | 5.41e+04 |
Aces | 536.2989 | 83.237 | 6.443 | 0.000 | 372.504 | 700.094 |
DoubleFaults | 3173.6079 | 243.427 | 13.037 | 0.000 | 2694.586 | 3652.630 |
Omnibus: | 143.150 | Durbin-Watson: | 1.977 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 1056.680 |
Skew: | 1.763 | Prob(JB): | 3.51e-230 |
Kurtosis: | 11.393 | Cond. No. | 185. |
We can check that the model effectively corresponds to what we created using SciKit Learn
by comparing the coefficients and intercept (const) with statsmodels
with the ones obtained in our previous model using SciKit Learn
, so we can use statsmodels
to diagnose what we built using SciKit Learn
:
print(
'Aces coefficient:\t\t', lm.coef_[0], '\n'
'DoubleFaults coefficient:\t', lm.coef_[1], '\n'
'Intercept:\t\t\t', lm.intercept_
)
Aces coefficient: 536.2988660480429 DoubleFaults coefficient: 3173.6078643403675 Intercept: 40577.862306813855
A popular method to evaluate the influence of each data point over a linear regression model (including likely outliers) is what's known as Cook's Distance, which measures the effect on the model of each observation once the observation has been removed. The higher Cook's Distance, the higher the influence (leverage) of the data point. Importantly, the size of the residual is not necessarily correlated with the size of its influence!
cooks_dists = sm_model.get_influence().cooks_distance
fig, axes = plt.subplots(1,2, figsize=(10,5))
sns.scatterplot(x=X_model['Aces'], y=cooks_dists[0], alpha=0.4, ax=axes[0]).set_title('Cook\'s Distance for Aces', fontsize=12, weight='bold');
sns.scatterplot(x=X_model['DoubleFaults'], y=cooks_dists[0], alpha=0.4, ax=axes[1]).set_title('Cook\'s Distance for DoubleFaults', fontsize=12, weight='bold');
axes[0].set_ylabel('Distance');
axes[1].set_ylabel('Distance');
There are a couple ways we can use to check leverage. The first one is visually plotting the values of the hat matrix diagonal against the studentised residuals values (quotient obtained by dividing residuals by the standard deviation). An observation with an absolute value higher than 3 for a studentised residual is usually considered abnormally high, but some would argue this is not stringent enough.
hat_diag = sm_model.get_influence().hat_matrix_diag
studentised_residuals = sm_model.get_influence().resid_studentized_external
fig = plt.figure(figsize=(5,5))
sns.regplot(x=hat_diag, y=studentised_residuals, fit_reg=False, scatter_kws={'alpha':0.4, 's':15}).set_title('Leverage v Studentised Residuals', fontsize=12, weight='bold')
plt.plot(hat_diag, [3]*len(hat_diag), c='r', alpha=0.4)
plt.plot(hat_diag, [-3]*len(hat_diag), c='r', alpha=0.4)
plt.ylabel('Studentised Residuals', fontsize=10)
plt.xlabel('Leverage', fontsize=10);
We can check these results in a tabular fashion:
studentised_residuals = sm_model.get_influence().resid_studentized_external
df_st_resid = pd.DataFrame({'Index (Obs)': X_train.index, 'Stud Resid': studentised_residuals})
# IMPORTANT: If you don't specify the index from the training set as a column (Index Obs),
# the index displayed in the results will correspond to the index of the `.redid_studentized_external` object, and NOT
# to the index of the original observation that was used to build the model, and you will wonder, for example,
# why the results show an index that doesn't reflect the original dataset, or why some plots show an index that doesn't correspond
# to the size of the dataset used to create the model (see influence plot a couple of paragraphs below).
outliers = df_st_resid[abs(df_st_resid['Stud Resid']) > 3]
outliers.sort_values(by='Index (Obs)')
Index (Obs) | Stud Resid | |
---|---|---|
206 | 107 | -4.821700 |
281 | 123 | 3.664859 |
250 | 313 | 4.512081 |
35 | 325 | 5.186609 |
107 | 555 | 5.671448 |
262 | 559 | 4.218032 |
10 | 892 | -3.409213 |
88 | 982 | 3.000684 |
235 | 1103 | 4.171207 |
Overall, having only 9 problematic points (for now) in a training set of 306 observations is outstandingly good, although -- again -- some people would argue that a cut-off point of 3
is too low a bar. This can be fine-tuned later if required.
The second one way in which we can check for leverage is applying the leverage cutt-off point given by the formula:
$$\frac{2k + 2}{n}$$
Where k
is the number of predictors, and n
is the number of observations in the model. Then we can use this to identify the points that have too much leverage:
leverage_cutoff = ((2*len(lm.coef_))+2)/len(X_train) #about 0.0196 for this model
df_influence = pd.DataFrame({'Index (Obs)': X_train.index, 'Influence': hat_diag})
# IMPORTANT: Again, if you don't specify the index from the training set as a column (Index Obs),
# the index displayed in the results will correspond to the index of the hat matrix, and NOT to the index of the original
# observation that was used to build the model.
high_influence = df_influence[(df_influence['Influence'] > leverage_cutoff)]
high_influence.sort_values(by='Influence', ascending=False)
Index (Obs) | Influence | |
---|---|---|
206 | 107 | 0.404833 |
55 | 49 | 0.153951 |
90 | 1573 | 0.129802 |
222 | 336 | 0.110273 |
58 | 63 | 0.102671 |
18 | 561 | 0.093042 |
62 | 705 | 0.077757 |
19 | 927 | 0.055980 |
46 | 725 | 0.054473 |
24 | 221 | 0.052040 |
51 | 361 | 0.038056 |
215 | 430 | 0.036055 |
187 | 732 | 0.034388 |
195 | 562 | 0.033957 |
235 | 1103 | 0.032844 |
137 | 101 | 0.032379 |
139 | 1005 | 0.029607 |
122 | 531 | 0.028054 |
40 | 612 | 0.027748 |
303 | 924 | 0.024581 |
37 | 388 | 0.023643 |
250 | 313 | 0.023528 |
79 | 554 | 0.021527 |
This time 23 observations were flagged, and 3 are repeats from the previous method: Indices 107, 313, and 1103.
So now we are wondering: can we choose one method over the other to determine influence? The answer is: No. Since each method focuses on a different aspect of the observations, they diagnose something different about the points as you can see by the difference in the results in the last couple of lines. Let me explain.
We applied the studentised residual cut-off threshold of 3
over the "Studentised Residuals" axis of the data, in order to determine outliers. On the other hand, the leverage cut-off threshold is applied over the "Hat Matrix Diagonal" axis of the data Check again the "Leverage v Studentised Residuals" graph we created a couple of lines ago, and you can see why this makes sense: having a high Influence value does not necessarily correlate with having a Studentised Residual value.
It might be argued that being an outlier automatically qualifies for having influence, but we need to remember we are applying two different criteria over two different dimensions of the data. Additionally, and although commonly outliers have high influence (hence the argument), this is not necessarily the case every time for all datasets. It is always good practice to check both outliers and the leverage of each point to see if the points having the largest influence are also the same ones flagged as outliers. This makes more sense if we plot an additional cut-off line on our graph, so we can see the "quadrants" where outliers have high or low leverage, and what points have high leverage without being outliers.
fig = plt.figure(figsize=(5,5))
sns.regplot(x=hat_diag, y=studentised_residuals, fit_reg=False, scatter_kws={'alpha':0.4, 's':15}).set_title('Leverage v Studentised Residuals', fontsize=12, weight='bold')
plt.plot(hat_diag, [3]*len(hat_diag), c='r', alpha=0.4)
plt.plot(hat_diag, [-3]*len(hat_diag), c='r', alpha=0.4)
plt.vlines(leverage_cutoff, ymin=-5, ymax=6, color='r', alpha=0.4)
plt.ylabel('Studentised Residuals', fontsize=10)
plt.xlabel('Leverage', fontsize=10);
Now, after we have seen and understood all of this, we can visualise influencial plots with statmodels
' influence plot, which conveniently labels some of the most influential points with the index of the original observation.
fig, ax = plt.subplots(figsize=(8,8));
sm.graphics.influence_plot(sm_model, criterion='cooks', ax=ax, alpha=0.3, );
We can see that, effectively, observation 107 is by far the most influential.
Now, it is tempting to just go and remove every data point that doesn't comply with our stipulations because that's what we've been taught to do, but it is important to remember that outliers sometimes are not simply "wrong" datapoints, the results of typos and machine malfunction: they can sometimes be real observations of the world and thus convey important information of the reality we are trying to model. This is especially true in human endeavour such as sports, earnings, hours worked in a week, and more, in which there are people who excel in scoring points, earning money, or working hours way beyond the vast majority of the population. This also calls into question whether certain datasets can be described by the chosen model (linear, logistic, etc), and how accurate is that decription.
Deleting these outliers is basically taking away information from the dataset, and thus the model will not take that information into consideration, for better or for worse. It is important to take pause and think about what we want to model, and disclose any caveats that we might have incurred into when creating a model, to make sure whoever uses it understands its limitations (for example, the range of income or hours worked in which the model is valid). In some cases the extreme observations are what we want to predict, in which case we should not remove them from the dataset.
What someone think is worth removing might be something worth keeping for someone else, and the trade-offs of the model accuracy vs the data represented by the model are for the modeller to consider. So the following will be an evaluation of the model without the points that we have so far identified as outliers or influential, which might not necessarily be what's ultimately best for the model.
#Getting the indices to be removed
outliers_indices = outliers['Index (Obs)'].to_list()
high_influence_indices = high_influence['Index (Obs)'].to_list()
to_remove = list(set(outliers_indices + high_influence_indices))
#Removing these observations
for i in to_remove:
X_train = X_train.drop(i)
y_train = y_train.drop(i)
lm.fit(X_train, y_train);
y_pred = lm.predict(X_test)
residuals = y_test - y_pred
X_model = sm.add_constant(X_train)
sm_model = sm.OLS(y_train, X_model).fit()
sm_model.summary()
Dep. Variable: | Winnings | R-squared: | 0.820 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.819 |
Method: | Least Squares | F-statistic: | 623.9 |
Date: | Tue, 11 Oct 2022 | Prob (F-statistic): | 9.78e-103 |
Time: | 18:07:24 | Log-Likelihood: | -3442.8 |
No. Observations: | 277 | AIC: | 6892. |
Df Residuals: | 274 | BIC: | 6902. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 2.735e+04 | 4470.318 | 6.118 | 0.000 | 1.86e+04 | 3.62e+04 |
Aces | 1007.3281 | 149.297 | 6.747 | 0.000 | 713.413 | 1301.243 |
DoubleFaults | 2738.2429 | 378.121 | 7.242 | 0.000 | 1993.851 | 3482.635 |
Omnibus: | 39.524 | Durbin-Watson: | 1.920 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 208.644 |
Skew: | 0.373 | Prob(JB): | 4.94e-46 |
Kurtosis: | 7.186 | Cond. No. | 94.0 |
fig, axes = plt.subplots(2,3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.3, wspace=0.22)
sns.set(font_scale=0.8)
#Distribution of residuals
sns.histplot(residuals, ax=axes[0,0]).set_title('Distribution of Residuals', fontsize=12, weight='bold')
sm.qqplot(residuals, line='r', ax=axes[0,1], markersize=4)
axes[0,1].set_title('QQ Plot', fontsize=12, weight='bold')
#Homoscedasticity
sns.scatterplot(x=y_test, y=residuals, ax=axes[0,2], alpha=0.4).set_title('Homoscedasticity', fontsize=12, weight='bold')
axes[0,2].set_ylabel('Residuals')
sns.lineplot(x=y_test, y=[0]*len(y_test), ax=axes[0,2], color='r');
#Correlation between independent variables and residuals
sns.scatterplot(x=X_test['Aces'], y=residuals, ax=axes[1,0], alpha=0.4).set_title('Relationship between \nAces and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['Aces'], y=[0]*len(X_test['Aces']), ax=axes[1,0], color='r');
sns.scatterplot(x=X_test['DoubleFaults'], y=residuals, ax=axes[1,1], alpha=0.4).set_title('Relationship between \nDoubleFaults and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['DoubleFaults'], y=[0]*len(X_test['DoubleFaults']), ax=axes[1,1], color='r');
#Autocorrelation in residuals
sns.lineplot(x=residuals.index, y=residuals, ax=axes[1,2]).set_title('Autocorrelation of Residuals', fontsize=12, weight='bold');
axes[1,2].set_ylabel('Residuals')
subplts = [axes[0,1], axes[0,2], axes[1,0], axes[1,1], axes[1,2]]
colnames = X_test.columns
for i in range(len(subplts)):
subplts[i].set_yticks([-400000, -200000, 0, 200000, 400000],['-400K', '-200K', '0', '200K', '400K'])
The model's R2 did not improve by much, and the model still does not do too well with the unknown data, likely because the unknown data (X_test
) also contains outliers and highly influential points (remember we only detected them in, and removed from, the portion of the data we originally used to train the model):
fig, axes = plt.subplots(1,2, figsize=(10,5))
sns.scatterplot(x=X_test['Aces'], y=y_test, alpha=0.4, ax=axes[0]).set_title('Aces and Winnings\nin Test Dataset', fontsize=12, weight='bold');
sns.scatterplot(x=X_test['DoubleFaults'], y=y_test, alpha=0.4, ax=axes[1]).set_title('DoubleFaults and Winnings\nin Test Dataset', fontsize=12, weight='bold');
The last iteration of this exercise will take one step back to try to fix the model by identifying and removing the outliers and highly influential points from the whole dataset. This time the outlier detection cut-off point will be more aggressive (2 instead of 3).
X_model = sm.add_constant(X)
sm_model = sm.OLS(y, X_model).fit()
hat_diag = sm_model.get_influence().hat_matrix_diag
studentised_residuals = sm_model.get_influence().resid_studentized_external
df_st_resid = pd.DataFrame({'Index (Obs)': X.index, 'Stud Resid': studentised_residuals})
outliers = df_st_resid[abs(df_st_resid['Stud Resid']) > 2]
outliers.sort_values(by='Index (Obs)')
Index (Obs) | Stud Resid | |
---|---|---|
44 | 56 | 2.739897 |
52 | 71 | 2.768442 |
74 | 101 | 2.275342 |
76 | 107 | -4.931413 |
86 | 123 | 3.636454 |
132 | 221 | -2.549095 |
160 | 288 | 2.098548 |
161 | 289 | 4.115735 |
166 | 308 | 5.263617 |
167 | 313 | 4.584519 |
171 | 325 | 5.115658 |
205 | 403 | -4.355229 |
216 | 432 | 2.662136 |
228 | 483 | 2.173804 |
248 | 555 | 5.619228 |
251 | 559 | 4.238278 |
252 | 561 | -2.332369 |
269 | 624 | 2.929817 |
281 | 672 | -4.376502 |
284 | 696 | 2.852104 |
295 | 732 | 2.419758 |
296 | 735 | -2.129421 |
320 | 892 | -3.360578 |
331 | 982 | 3.059259 |
344 | 1076 | 2.993149 |
347 | 1103 | 4.270050 |
403 | 1473 | -2.579514 |
leverage_cutoff = ((2*len(lm.coef_))+2)/len(X) #about 0.0137 this time
df_influence = pd.DataFrame({'Index (Obs)': X.index, 'Influence': hat_diag})
high_influence = df_influence[(df_influence['Influence'] > leverage_cutoff)]
high_influence.sort_values(by='Influence', ascending=False)
Index (Obs) | Influence | |
---|---|---|
76 | 107 | 0.327400 |
281 | 672 | 0.144068 |
40 | 49 | 0.127635 |
176 | 336 | 0.089514 |
415 | 1573 | 0.081889 |
285 | 705 | 0.065956 |
48 | 63 | 0.062699 |
252 | 561 | 0.057447 |
284 | 696 | 0.048018 |
342 | 1068 | 0.044198 |
324 | 927 | 0.034569 |
132 | 221 | 0.034551 |
293 | 725 | 0.034326 |
158 | 284 | 0.032214 |
295 | 732 | 0.028805 |
403 | 1473 | 0.028789 |
253 | 562 | 0.026947 |
166 | 308 | 0.026098 |
74 | 101 | 0.026080 |
265 | 612 | 0.023696 |
188 | 361 | 0.023148 |
214 | 430 | 0.022738 |
205 | 403 | 0.021963 |
333 | 1005 | 0.021381 |
296 | 735 | 0.020666 |
347 | 1103 | 0.020288 |
64 | 85 | 0.019682 |
323 | 924 | 0.018797 |
242 | 531 | 0.018779 |
136 | 231 | 0.018441 |
247 | 554 | 0.018314 |
199 | 388 | 0.016137 |
370 | 1222 | 0.015783 |
304 | 793 | 0.015362 |
167 | 313 | 0.014513 |
#Data
X = tennis_data[['Aces', 'DoubleFaults']]
y = tennis_data['Winnings']
#Getting the indices to be removed
outliers_indices = outliers['Index (Obs)'].to_list()
high_influence_indices = high_influence['Index (Obs)'].to_list()
to_remove = list(set(outliers_indices + high_influence_indices))
#Removing these observations
for i in to_remove:
X = X.drop(i)
y = y.drop(i)
#Split data, train, and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state=100)
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
residuals = y_test - y_pred
#Visualise and Diagnose
fig, axes = plt.subplots(2,3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.3, wspace=0.22)
sns.set(font_scale=0.8)
#Distribution of residuals
sns.histplot(residuals, ax=axes[0,0]).set_title('Distribution of Residuals', fontsize=12, weight='bold')
axes[0,0].set_xticks([-150000, -100000, -50000, 0, 50000, 100000, 150000],['-150K', '-100K', '-50K', '0', '50K', '100K', '150K'])
sm.qqplot(residuals, line='r', ax=axes[0,1], markersize=4)
axes[0,1].set_title('QQ Plot', fontsize=12, weight='bold')
#Homoscedasticity
sns.scatterplot(x=y_test, y=residuals, ax=axes[0,2], alpha=0.4).set_title('Homoscedasticity', fontsize=12, weight='bold')
axes[0,2].set_ylabel('Residuals')
sns.lineplot(x=y_test, y=[0]*len(y_test), ax=axes[0,2], color='r');
#Correlation between independent variables and residuals
sns.scatterplot(x=X_test['Aces'], y=residuals, ax=axes[1,0], alpha=0.4).set_title('Relationship between \nAces and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['Aces'], y=[0]*len(X_test['Aces']), ax=axes[1,0], color='r');
sns.scatterplot(x=X_test['DoubleFaults'], y=residuals, ax=axes[1,1], alpha=0.4).set_title('Relationship between \nDoubleFaults and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['DoubleFaults'], y=[0]*len(X_test['DoubleFaults']), ax=axes[1,1], color='r');
#Autocorrelation in residuals
sns.lineplot(x=residuals.index, y=residuals, ax=axes[1,2]).set_title('Autocorrelation of Residuals', fontsize=12, weight='bold');
axes[1,2].set_ylabel('Residuals')
subplts = [axes[0,1], axes[0,2], axes[1,0], axes[1,1], axes[1,2]]
colnames = X_test.columns
for i in range(len(subplts)):
subplts[i].set_yticks([-400000, -200000, 0, 200000, 400000],['-400K', '-200K', '0', '200K', '400K'])
#Diagnose
X_model = sm.add_constant(X_train)
sm_model = sm.OLS(y_train, X_model).fit()
sm_model.summary()
Dep. Variable: | Winnings | R-squared: | 0.823 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.821 |
Method: | Least Squares | F-statistic: | 623.4 |
Date: | Tue, 11 Oct 2022 | Prob (F-statistic): | 1.01e-101 |
Time: | 18:07:25 | Log-Likelihood: | -3319.1 |
No. Observations: | 272 | AIC: | 6644. |
Df Residuals: | 269 | BIC: | 6655. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 2.669e+04 | 3551.116 | 7.515 | 0.000 | 1.97e+04 | 3.37e+04 |
Aces | 492.2228 | 122.894 | 4.005 | 0.000 | 250.267 | 734.179 |
DoubleFaults | 3421.4952 | 291.065 | 11.755 | 0.000 | 2848.440 | 3994.551 |
Omnibus: | 18.608 | Durbin-Watson: | 1.989 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 51.668 |
Skew: | 0.188 | Prob(JB): | 6.03e-12 |
Kurtosis: | 5.102 | Cond. No. | 80.7 |
fig = plt.figure(figsize=(5,5))
sns.regplot(x=y_test, y=y_pred, scatter_kws={'alpha':0.4, 's':15}).set_title('Actual v Predicted Winnings', fontsize=12, weight='bold')
plt.ylabel('Predicted Winnings', fontsize=10)
plt.xlabel('Actual Winnings', fontsize=10);
The distribution of the residuals improved greatly, however we still have heteroscedasticity and other issues.
We could keep iterating to fix this model by removing data and increasing our score (R2 value) and trying to fix the other issues, however we would be removing more and more data just to make the model work better. With my current level of stringency, I have already removed 49 out of 438 observations (more than 10%!). What if this is as good as the model will get with this dataset? We can't just eliminate half of the dataset to get a perfect score. Alternatively, we could go back and delete outliers only but not highly influential points. There are many options we could explore, but this exercise is already long as it is.
An R2 value of 0.82 can be considered very good considering that:
Additionally, it is important to remember that we have chosen (for the sake of the exercise) to fit a linear regression model. It is possible that other models might fit the data better, but exploring those options escapes the scope of this exercise.
So far, we can be happy with this model. What do you think?
Thanks for reading!