Multiple Linear Regression Using ATP Men's League Data¶

Overview¶

This project aims to use data from the Association of Tennis Professionals (ATP) men's league from 2009 to 2017 to build a multiple linear regression model and understand what features and variables are correlated linearly with success as a male tennis player. Rather than giving rare and valuable insight on tennis, the purpose of this project is to build and evaluate a multilinear regression model and evaluate it for pedagogic and display purposes only.

The data represents the top 1500 players within the aforementioned period. Credit for the data goes to Codecademy. Beyond ID data, the dataset includes 'offensive' (service game) statistics, 'defensive' (return game) statistics, and outcomes. A description of each column in the dataset was provided:

Identifying Data:¶

-Player: name of the tennis player.

-Year: year data was recorded.

"Offensive" (Service Game) Columns:¶

-Aces: number of serves by the player where the receiver does not touch the ball

-DoubleFaults: number of times player missed both first and second serve attempts

-FirstServe: % of first-serve attempts made

-FirstServePointsWon: % of first-serve attempt points won by the player

-SecondServePointsWon: % of second-serve attempt points won by the player

-BreakPointsFaced: number of times where the receiver could have won service game of the player

-BreakPointsSaved: % of the time the player was able to stop the receiver from winning service game when they had the chance

-ServiceGamesPlayed: total number of games where the player served

-ServiceGamesWon: total number of games where the player served and won

-TotalServicePointsWon: % of points in games where the player served that they won

"Defensive" (Return Game) Columns:¶

-FirstServeReturnPointsWon: % of opponents first-serve points the player was able to win

-SecondServeReturnPointsWon: % of opponents second-serve points the player was able to win

-BreakPointsOpportunities: number of times where the player could have won the service game of the opponent

-BreakPointsConverted: % of the time the player was able to win their opponent’s service game when they had the chance

-ReturnGamesPlayed: total number of games where the player’s opponent served

-ReturnGamesWon: total number of games where the player’s opponent served and the player won

-ReturnPointsWon: total number of points where the player’s opponent served and the player won

-TotalPointsWon: % of points won by the player

Outcomes:¶

-Wins: number of matches won in a year

-Losses: number of matches lost in a year

-Winnings: total winnings in USD($) in a year

-Ranking: ranking at the end of year


This pipeline uses seaborn, pandas, matplotlib, statsmodels and sklearn. If there is something you think could be improved, or done differently, feel free to reach out!

Step 1 - Inital Exploratory Data Analysis¶

We'll start by importing all the required libraries, the dataset, and performing an initial exploration of the dataset to check for missing values, issues with data types, etc.

In [ ]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
In [ ]:
tennis_data = pd.read_csv('tennis_stats.csv')
tennis_data.head()
Out[ ]:
Player Year FirstServe FirstServePointsWon FirstServeReturnPointsWon SecondServePointsWon SecondServeReturnPointsWon Aces BreakPointsConverted BreakPointsFaced ... ReturnGamesWon ReturnPointsWon ServiceGamesPlayed ServiceGamesWon TotalPointsWon TotalServicePointsWon Wins Losses Winnings Ranking
0 Pedro Sousa 2016 0.88 0.50 0.38 0.50 0.39 0 0.14 7 ... 0.11 0.38 8 0.50 0.43 0.50 1 2 39820 119
1 Roman Safiullin 2017 0.84 0.62 0.26 0.33 0.07 7 0.00 7 ... 0.00 0.20 9 0.67 0.41 0.57 0 1 17334 381
2 Pedro Sousa 2017 0.83 0.60 0.28 0.53 0.44 2 0.38 10 ... 0.16 0.34 17 0.65 0.45 0.59 4 1 109827 119
3 Rogerio Dutra Silva 2010 0.83 0.64 0.34 0.59 0.33 2 0.33 5 ... 0.14 0.34 15 0.80 0.49 0.63 0 0 9761 125
4 Daniel Gimeno-Traver 2017 0.81 0.54 0.00 0.33 0.33 1 0.00 2 ... 0.00 0.20 2 0.50 0.35 0.50 0 1 32879 272

5 rows × 24 columns

In [ ]:
tennis_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1721 entries, 0 to 1720
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Player                      1721 non-null   object 
 1   Year                        1721 non-null   int64  
 2   FirstServe                  1721 non-null   float64
 3   FirstServePointsWon         1721 non-null   float64
 4   FirstServeReturnPointsWon   1721 non-null   float64
 5   SecondServePointsWon        1721 non-null   float64
 6   SecondServeReturnPointsWon  1721 non-null   float64
 7   Aces                        1721 non-null   int64  
 8   BreakPointsConverted        1721 non-null   float64
 9   BreakPointsFaced            1721 non-null   int64  
 10  BreakPointsOpportunities    1721 non-null   int64  
 11  BreakPointsSaved            1721 non-null   float64
 12  DoubleFaults                1721 non-null   int64  
 13  ReturnGamesPlayed           1721 non-null   int64  
 14  ReturnGamesWon              1721 non-null   float64
 15  ReturnPointsWon             1721 non-null   float64
 16  ServiceGamesPlayed          1721 non-null   int64  
 17  ServiceGamesWon             1721 non-null   float64
 18  TotalPointsWon              1721 non-null   float64
 19  TotalServicePointsWon       1721 non-null   float64
 20  Wins                        1721 non-null   int64  
 21  Losses                      1721 non-null   int64  
 22  Winnings                    1721 non-null   int64  
 23  Ranking                     1721 non-null   int64  
dtypes: float64(12), int64(11), object(1)
memory usage: 322.8+ KB
In [ ]:
tennis_data.isna().sum()
Out[ ]:
Player                        0
Year                          0
FirstServe                    0
FirstServePointsWon           0
FirstServeReturnPointsWon     0
SecondServePointsWon          0
SecondServeReturnPointsWon    0
Aces                          0
BreakPointsConverted          0
BreakPointsFaced              0
BreakPointsOpportunities      0
BreakPointsSaved              0
DoubleFaults                  0
ReturnGamesPlayed             0
ReturnGamesWon                0
ReturnPointsWon               0
ServiceGamesPlayed            0
ServiceGamesWon               0
TotalPointsWon                0
TotalServicePointsWon         0
Wins                          0
Losses                        0
Winnings                      0
Ranking                       0
dtype: int64
In [ ]:
tennis_data['Player'].value_counts()
Out[ ]:
Ivan Dodig         9
Leonardo Mayer     9
Dudi Sela          9
Evgeny Donskoy     9
Go Soeda           9
                  ..
Gleb Sakharov      1
Vaclav Safranek    1
Alex de Minaur     1
Takao Suzuki       1
Kento Takeuchi     1
Name: Player, Length: 438, dtype: int64
In [ ]:
tennis_data.describe()
Out[ ]:
Year FirstServe FirstServePointsWon FirstServeReturnPointsWon SecondServePointsWon SecondServeReturnPointsWon Aces BreakPointsConverted BreakPointsFaced BreakPointsOpportunities ... ReturnGamesWon ReturnPointsWon ServiceGamesPlayed ServiceGamesWon TotalPointsWon TotalServicePointsWon Wins Losses Winnings Ranking
count 1721.000000 1721.000000 1721.000000 1721.000000 1721.000000 1721.000000 1721.000000 1721.000000 1721.000000 1721.000000 ... 1721.000000 1721.000000 1721.000000 1721.000000 1721.000000 1721.000000 1721.000000 1721.000000 1.721000e+03 1721.000000
mean 2013.646717 0.598053 0.680738 0.261673 0.479733 0.466432 97.105171 0.369407 112.003486 102.918071 ... 0.173823 0.342208 197.650203 0.715590 0.473155 0.599245 7.876816 9.278908 2.344928e+05 269.610691
std 2.488018 0.054533 0.070422 0.056639 0.066902 0.068447 137.966077 0.162987 119.247651 122.761670 ... 0.080880 0.049369 221.208703 0.123287 0.037139 0.057718 10.183716 8.996450 2.530537e+05 277.341947
min 2009.000000 0.360000 0.270000 0.000000 0.060000 0.000000 0.000000 0.000000 1.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.220000 0.250000 0.000000 0.000000 1.080000e+02 3.000000
25% 2012.000000 0.570000 0.650000 0.240000 0.460000 0.440000 7.000000 0.320000 15.000000 9.000000 ... 0.130000 0.320000 22.000000 0.670000 0.460000 0.570000 0.000000 2.000000 4.931100e+04 83.000000
50% 2014.000000 0.600000 0.690000 0.270000 0.490000 0.480000 34.000000 0.380000 55.000000 41.000000 ... 0.180000 0.350000 86.000000 0.750000 0.480000 0.610000 3.000000 5.000000 1.252120e+05 166.000000
75% 2016.000000 0.630000 0.720000 0.290000 0.520000 0.500000 140.000000 0.430000 201.000000 172.000000 ... 0.220000 0.370000 348.000000 0.790000 0.500000 0.630000 13.000000 17.000000 3.500750e+05 333.000000
max 2017.000000 0.880000 0.890000 0.480000 0.920000 0.750000 1185.000000 1.000000 507.000000 573.000000 ... 0.560000 0.510000 916.000000 1.000000 0.820000 0.820000 48.000000 36.000000 1.074562e+06 1443.000000

8 rows × 23 columns

It looks like the dataset has no missing entries and that the dtypes correspond to what we expected. Additionally, we have a good amount of players represented and a good range of rankings. However there are many players who are 'repeats' in the data since they frequently feature in the highest rankings. Since our model for linear regression relies on independent observations, we need can't work with players who have more than one entry. So before doing ahything else, we'll drop the duplicates based on the column Player.

In [ ]:
tennis_data = tennis_data.drop_duplicates(subset='Player')

The first thing we'll now do is perform exploratory analysis via plotting. Considering that we're interested in "success" for a tennis player, it would be useful to see what features (variables) correlate strongly with outcomes. There are four outcome columns in the dataset: Wins, Losses, Winnings and Ranking. Since Winnings and Ranking are basically a function of Wins and Losses (at least out of sheer logic), we'll focus on the latter two to see what predicts (i.e. correlates with) these outcomes. We'll thus explore that variables from the "Offensive" and "Defensive" categories correlate strongly with the outcomes. We'll also plot the distribution of the independent variables while we're at it, and check the outlier situation for each variable.

Let's start with the "Offensive" variables.

In [ ]:
offensive_vars = tennis_data.columns[[7,12,2,3,5,9,11,16,17,19]]
fig, axes = plt.subplots(len(offensive_vars),4, figsize=(25, 60))
fig.subplots_adjust(hspace=0.3, wspace=0.2)
sns.set(font_scale=1.3)

for var in offensive_vars:
    sns.scatterplot(x=var, y='Wins', data=tennis_data, ax=axes[offensive_vars.tolist().index(var),0], alpha=0.4).set_title(f'{var} vs Wins', fontsize=18, weight='bold')
    sns.scatterplot(x=var, y='Losses', data=tennis_data, ax=axes[offensive_vars.tolist().index(var),1], alpha=0.4).set_title(f'{var} vs Losses', fontsize=18, weight='bold')
    sns.histplot(x=var, data=tennis_data, ax=axes[offensive_vars.tolist().index(var),2]).set_title('Distribution', fontsize=18, weight='bold')
    sns.boxplot(x=var, data=tennis_data, ax=axes[offensive_vars.tolist().index(var),3]).set_title('Outliers', fontsize=18, weight='bold')

We'll do the same for the "Defensive" variables now.

In [ ]:
defensive_vars = tennis_data.columns[[4,6,10,8,13,14,15,18]]
fig, axes = plt.subplots(len(defensive_vars),4, figsize=(25, 50))
fig.subplots_adjust(hspace=0.3, wspace=0.2)
sns.set(font_scale=1.3)

for var in defensive_vars:
    sns.scatterplot(x=var, y='Wins', data=tennis_data, ax=axes[defensive_vars.tolist().index(var),0], alpha=0.4).set_title(f'{var} vs Wins', fontsize=18, weight='bold')
    sns.scatterplot(x=var, y='Losses', data=tennis_data, ax=axes[defensive_vars.tolist().index(var),1], alpha=0.4).set_title(f'{var} vs Losses', fontsize=18, weight='bold')
    sns.histplot(x=var, data=tennis_data, ax=axes[defensive_vars.tolist().index(var),2]).set_title('Distribution', fontsize=18, weight='bold')
    sns.boxplot(x=var, data=tennis_data, ax=axes[defensive_vars.tolist().index(var),3]).set_title('Outliers', fontsize=18, weight='bold')

This gives us a nice overview of the shape of the variables and their relationship to our two most important outcomes: Wins and Losses.

From a quick glance at this figures, we can see that there are four independent variables that show a good linear correlation with both outcomes of interest: two "offensive" variables (BreakPointsFaced and ServiceGamesPlayed), and two "defensive" variables (BreakPointsOpportunities and ReturnGamesPlayer), with BreakPointsFaced and Wins showing the weakest association of this lot. They also have very few outliers.

But we have a big problem: these variables correlate positively with both Wins and Losses, meaning that they are useless to understand what we should focus on (if we were tennis players) if we wanted to maximise our wins while avoiding losses.

Additionally, if we rehash from the description of these variables:

-BreakPointsFaced: number of times where the receiver could have won service game of the player.

-ServiceGamesPlayed: total number of games where the player served.

-BreakPointsOpportunities: number of times where the player could have won the service game of the opponent.

-ReturnGamesPlayed: total number of games where the player’s opponent served.

It appears as if the relationship between these variables and the chosen outcomes become a sort of "the more you play, the more you can either win or lose" (which is basically useless insight), and that we might have some collinearity with some of these features when we evaluate the model (we'll evaluate this later).

Another option we have is to explore the relationship of the same variables with the other outcomes that we sidelined (Winnings and Ranking). We'll do this now so we can keep exploring in order to choose the best features to build our model with hopefully some interesting insights. This time, however, we'll visualise the relationships between all variables using a correlation matrix visualised as a heatmap.

In [ ]:
data = tennis_data.drop(['Player', 'Year'], axis=1).corr()
labels = round(data,2)

sns.set(font_scale=0.85, rc={'figure.figsize': (12,12)})
sns.heatmap(data, cmap='vlag', center=0, annot=labels).set_title('Tennis Data Linear Relationships Heatmap', weight='bold', fontsize=18)
plt.show()

The .corr() method calculates the correlation of the variables (using Pearson's approach by default) and outputs a correlation matrix that contains Pearson's coefficients for each variable in the dataset compared to each other. Visualising this correlation matrix as a heatmap is a convenient way to quickly check linear relationships between variables when the dataset contains too many variables and a pairplot woul give you an output that's too large and cumbersome to read. Additionally, using this method allows us to see the numerical values of said relationships (so we don't have to guess how strong a relationship is). The downside of this approach is that it will not show you other kinds of relationships that scatterplots might make evident, so be careful!

In the case of this dataset, there appears to be no variable that correlates uniquely with either Wins, Losses, Winnings or Ranking. In fact, the variables that correlate most strongly with Wins also correlate as strongly with Losses and Winnings, unfortunately. From this perspecttive, it appears that we will not be getting much insight into what predicts either outcome, and this dataset might just be the reflection of the non-linear relationships in the data at this level of performance in tennis playing.

Nonetheless, we will build and evaluate some linear models to understand the underlying assumptions of the models and common pitfalls to avoid.

Step 2 - Build an Initial Model to Check Performance and Assumptions.¶

Linear regression, although simple to understand and powerful in many cases, relies on several underlying assumptions that need to be met if the model is going to be an accurate representation of the data, and if it's going give us appropriate predictive power. These assumptions are:

-Independence of observations.

-No hidden or missing variables.

-Linear relationship.

-Normality of residuals.

-No or little multicollinearity.

-Homoscedasticity.

-Independent variables are uncorrelated with the error term.

-Observations of error term are uncorrelated with each other.

Let's explore them one by one and use the dataset at hand to check if the model we will build passes the test.

Independence of observations: Simply put, this means that one observation is no related to another in some obvious manner (e.g. entries from the same patient in the case of medical data or vendor in the case of sales data). In the case of our dataset, data from the same player being more than one entry due to the player being in the top rankings more than one year. We have dealt with this issue earlier when used the .drop_duplicates() method using the Player column as reference. If the data had no ID column to it (or was anonymised for ethical purposes), we need to check the methodology under which the data was collected to see if this assumption has been met for the dataset at hand.

No hidden or missing variables: Basically, that you have (hopefully) collected all the relevant variables that predict an outcome, otherwise the model will simply not be a good reflection of what variables predict the outcome. It is hard to check for this assumption since it's more effectively a question of methodology design rather than of data evaluation, but if you end up with the wrong model -- or at the very least an incomplete one -- even after doing everything right this might be a plausible explanation and consideration for future research approaches.

Linear relationship: This seems obvious, but it is important to check that the independent variable(s) and the outcome share a linear relationship. If you can see by eye in a scatterplot that the relation is not linear (e.g. it is exponential, polynomial, etc.) and you still apply a linear regression model to the data, you are simply applying the wrong model. This is why the initial exploratory data is critical to see if a linear model might be appropriate (or at least not totally misguided). Have a look again at the scatterplots and the heatmap to check what variables share (or do not share) a linear relationship with the outcome variables. As a rule of thumb, a Pearson's score above 0.7 (for positive linear relationships) or below -0.7 (for negative linear relationships) can be considered a strong linear relationship.

To check the next set of assumptions we need to actually build a model and predict some values using the model and a validation dataset. As convention, we will use X as the collection of independent variables and y as the outcome (dependent/response) variable.

IMPORTANT NOTE: The initial model will contain all the variables that have a high linear correlation with the outcome Winnings, and will thus violate many assumptions for the sake of the exercise, which we will correct later.

In [ ]:
X = tennis_data[['Aces', 'BreakPointsFaced', 'BreakPointsOpportunities', 'DoubleFaults', 'ReturnGamesPlayed','ServiceGamesPlayed']]
y = tennis_data['Winnings']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state=100)

lm = LinearRegression()
lm.fit(X_train, y_train);

Now let's get back to checking our assumptions.

Normality of residuals: Residuals are the difference between the values predicted from the model from the response variable (outcome) in the validation data, and the real outcome values from the validation data. If the values are not normally distributed, it means that the model is not appropriate for the data at hand (e.g. non-linear relationship, multiple linear relationships rather a single one, etc). For this, we first need to calculate the residuals:

In [ ]:
y_pred = lm.predict(X_test) #Predicted outcome values using the model and the test data
residuals = y_test - y_pred

And now we plot them:

In [ ]:
fig = plt.figure(figsize=(5,5))
plt.hist(residuals, bins=20)
plt.title('Distribution of Residuals', fontsize=12, weight='bold')
plt.ylabel('Frequency', fontsize=10)
plt.show()

The distribution doesn't look close to normal: it is heavily centered and it has extra weight on the tails. This might indicate that there is a problem with the model. Another way to look at the residuals is what is known as a QQ-plot, which graphs the relationship between the quantiles in the sample data against the theoretical quantiles that the data would have if it followed a perfectly normal distribution.

In [ ]:
import statsmodels.api as sm 

with plt.rc_context():
    plt.rc("figure", figsize=(5,5))
    fig = sm.qqplot(residuals, line='r') #by default compares to normal distribution, but can be changed
    plt.title('QQ-Plot of Residuals', fontsize=12, weight='bold')
plt.show()

A QQ-plot of the residuals of a model that follows a normal distribution would show the points fitting the line to a good degree. This is not the case, further confirming that the model has some issues as the residuals do not distribute normally. The "weight at the tails" we noted in the previous histogram is also showed here as outliers at both ends of the line. This calls to review the model, but we will do that after checking for all the other assumptions that we are missing.

No or little multicollinearity: Collinearity is when two variables are linearly and highly correlated, and they are both used to predict the outcome of choice. Multicollinearity is when the variable is highly correlated with two or more other independent variables. This is important to check for a number of reasons:

  • More variables predicting an outcome is not always good if they are related to each other, as it means the explanatory power of one variable cannot be disentangled from the other variable. Put differently, the model would not know what variable is more important than the other when predicting the outcome, rendering the model redundant. By way of example, if we wanted to predict heart disease status in people and had the variables "height", "weight" and "BMI" in our dataset, it is highly likely that these variables will present (multi)collinearity, since they are related (and BMI is effectively a function of height and weight).

  • It introduces noise in the model, as each variable comes with its own error when fitting, so the more variables we have, the higher the total error of the model, which is especially unnecessary when one of the variables could be dropped without losing predictive power.

A good way to check for (multi)collinearity is the Variance Inflation Factor (VIF). The VIF shows "the strength of the linear relationship between the variable and the remaining explanatory variables." and how much the regression coefficient of the variable at hand is 'inflated' due to the collinearity. This also means that the relationship of the variable is affected by the presence of the other variables, and thus makes the predictions less accurate. It is an error we need to minimise!

The VIF can be calculated using statmodels:

In [ ]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame({'Feature': X_test.columns, 'VIF': [variance_inflation_factor(X_test.values, i) for i in range(len(X_test.columns))]})
vif_data
Out[ ]:
Feature VIF
0 Aces 16.079797
1 BreakPointsFaced 76.148170
2 BreakPointsOpportunities 43.635450
3 DoubleFaults 16.608583
4 ReturnGamesPlayed 7380.594760
5 ServiceGamesPlayed 7620.490832

This shows that there is (impressive) multicollinearity in our model, understandably: since playing a Return Game means by default that you will play a Service Game and vice versa, these two variables are linked and are interdependent. We will remove them in our next iterations. Similar story with the variables related to Break Points.

Ideally, our model should show VIFs close to 0 for all variables (the closer to 0 the better), but less than 5 is acceptable. From 5 to 10 is undesirable and over 10 is unacceptable and would mean that there is something fundamentally wrong with the model.

Homoscedasticity: This means that the error (difference) between the predicted outcome values and the actual outcome values is constant. This can be quickly checked with a scatterplot of the residuals against the data of the outcome variable. We should expect that the points show constant variation with respect to the zero line.

In [ ]:
fig = plt.figure(figsize=(5,5))
plt.scatter(y_test, residuals, alpha=0.4, s=15)
plt.plot(y_test, [0]*len(y_test), c='r')
plt.ylabel('Error size')
plt.xlabel('Outcome variable (Winnings, in millions)')
plt.title('Variance in Error of Residuals', fontsize=12, weight='bold')
plt.show()

In this case, the pattern is clearly not homoscedastic (i.e. it is heteroscedastic).

Independent variables are uncorrelated with the error term: Basically, that there are no glaringly obvious patterns and relationships between the independent variables and the residuals.

In [ ]:
fig, axes = plt.subplots(2,3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.3, wspace=0.22)
sns.set(font_scale=0.8)

subplts = [axes[0,0], axes[0,1], axes[0,2], axes[1,0], axes[1,1], axes[1,2]]
colnames = X_test.columns

for i in range(len(colnames)):
    sns.scatterplot(x=X_test[colnames[i]], y=residuals, ax=subplts[i], alpha=0.4).set_title(f'{colnames[i]} vs \nResiduals (Winnings)', fontsize=12, weight='bold')
    sns.lineplot(x=X_test[colnames[i]], y=[0]*len(X_test[colnames[i]]), ax=subplts[i], color='r')
    subplts[i].set_yticks([-300000, -200000, -100000, 0, 100000, 200000, 300000],['-300K', '-200K', '-100K', '0', '100K', '200K', '300K'])
    subplts[i].set_ylabel('Residuals (Winnings)')

Here we do not see any correlation between the variables and the residuals, but the fact that the same pattern repeats between all variables and the residuals is probably another indication of the collinearity we observed numerically with the VIF.

Observations of error term are uncorrelated with each other: Basically, that there are no patterns in the residuals themselves (similar to the check for normality of residuals. This means that one observation of the error term (residual) should not predict the next one. We want to observe randomness instead of patterns, cycles, and so on.

In [ ]:
fig = plt.figure(figsize=(12,5))
sns.lineplot(x=residuals.index, y=residuals, linewidth=1)
plt.xlabel('Index of Residual')
plt.ylabel('Residuals (Winnings)')
plt.yticks([-300000, -200000, -100000, 0, 100000, 200000, 300000],['-300K', '-200K', '-100K', '0', '100K', '200K', '300K'])
plt.title('Autorrelation in Residuals', fontsize=12, weight='bold')

plt.show()

This assumption is also met.

Step 3 - Diagnosing and Correcting the Model¶

Now that we know where the model is failing, we will go back and correct these problems, one by one. We Will start with the most glaring error: the levels of multicollinearity. Once we fix this we will re-evaluate if the new model complies with our assumptions. We will start by re-creating the model, but omitting the variables that had the highest VIF values: ReturnGamesPlayed and ServiceGamesPlayed.

In [ ]:
X = tennis_data[['Aces', 'BreakPointsFaced', 'BreakPointsOpportunities', 'DoubleFaults']]
y = tennis_data['Winnings']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state=100)

lm.fit(X_train, y_train);

vif_data = pd.DataFrame({'Feature': X_test.columns, 'VIF': [variance_inflation_factor(X_test.values, i) for i in range(len(X_test.columns))]})
vif_data
Out[ ]:
Feature VIF
0 Aces 4.030206
1 BreakPointsFaced 27.177893
2 BreakPointsOpportunities 19.802520
3 DoubleFaults 14.101704

That fixed the VIF values for most of our variables. But we still have variables with high VIF values. Let's remove the next set of evidently collinear variables: BreakPointsFaced and BreakPointsOpportunities.

In [ ]:
X = tennis_data[['Aces', 'DoubleFaults']]
y = tennis_data['Winnings']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state=100)

lm.fit(X_train, y_train);

vif_data = pd.DataFrame({'Feature': X_train.columns, 'VIF': [variance_inflation_factor(X_train.values, i) for i in range(len(X_train.columns))]})
vif_data
Out[ ]:
Feature VIF
0 Aces 4.543423
1 DoubleFaults 4.543423

This is much better. We can be happy with that and move forward with checking if the model passes the rest of the tests. This time, without the commentary.

In [ ]:
y_pred = lm.predict(X_test)
residuals = y_test - y_pred
In [ ]:
fig, axes = plt.subplots(2,3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.3, wspace=0.22)
sns.set(font_scale=0.8)

#Distribution of residuals
sns.histplot(residuals, ax=axes[0,0]).set_title('Distribution of Residuals', fontsize=12, weight='bold')
sm.qqplot(residuals, line='r', ax=axes[0,1], markersize=4)
axes[0,1].set_title('QQ Plot', fontsize=12, weight='bold')

#Homoscedasticity
sns.scatterplot(x=y_test, y=residuals, ax=axes[0,2], alpha=0.4).set_title('Homoscedasticity', fontsize=12, weight='bold')
axes[0,2].set_ylabel('Residuals')
sns.lineplot(x=y_test, y=[0]*len(y_test), ax=axes[0,2], color='r');

#Correlation between independent variables and residuals
sns.scatterplot(x=X_test['Aces'], y=residuals, ax=axes[1,0], alpha=0.4).set_title('Relationship between \nAces and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['Aces'], y=[0]*len(X_test['Aces']), ax=axes[1,0], color='r');
sns.scatterplot(x=X_test['DoubleFaults'], y=residuals, ax=axes[1,1], alpha=0.4).set_title('Relationship between \nDoubleFaults and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['DoubleFaults'], y=[0]*len(X_test['DoubleFaults']), ax=axes[1,1], color='r');

#Autocorrelation in residuals
sns.lineplot(x=residuals.index, y=residuals, ax=axes[1,2]).set_title('Autocorrelation of Residuals', fontsize=12, weight='bold');
axes[1,2].set_ylabel('Residuals')

subplts = [axes[0,1], axes[0,2], axes[1,0], axes[1,1], axes[1,2]]
colnames = X_test.columns

for i in range(len(subplts)):
    subplts[i].set_yticks([-400000, -200000, 0, 200000, 400000],['-400K', '-200K', '0', '200K', '400K'])

Sometimes fixing multicollinearity fixes other issues. This time it did not.

Our next clue to fix our model is to look at the distribution of the residuals in the histogram and the QQ plot. It appears as if the presence of the most extreme values in the residuals throw the model a bit overboard. Recall from the boxplot of our initial EDA that there are some extreme values in both Aces and DoubleFaults.

The problem is that Scikit Learn's LinearRegression does not have manny in-built functions to diagnose influence of outliers and other summary tables. We can use statsmodels for this, although we have to re-create our linear regression model.

In [ ]:
X_model = sm.add_constant(X_train)
sm_model = sm.OLS(y_train, X_model).fit()
sm_model.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: Winnings R-squared: 0.813
Model: OLS Adj. R-squared: 0.812
Method: Least Squares F-statistic: 659.2
Date: Tue, 11 Oct 2022 Prob (F-statistic): 4.34e-111
Time: 18:07:23 Log-Likelihood: -3963.6
No. Observations: 306 AIC: 7933.
Df Residuals: 303 BIC: 7944.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 4.058e+04 6876.396 5.901 0.000 2.7e+04 5.41e+04
Aces 536.2989 83.237 6.443 0.000 372.504 700.094
DoubleFaults 3173.6079 243.427 13.037 0.000 2694.586 3652.630
Omnibus: 143.150 Durbin-Watson: 1.977
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1056.680
Skew: 1.763 Prob(JB): 3.51e-230
Kurtosis: 11.393 Cond. No. 185.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can check that the model effectively corresponds to what we created using SciKit Learn by comparing the coefficients and intercept (const) with statsmodels with the ones obtained in our previous model using SciKit Learn, so we can use statsmodels to diagnose what we built using SciKit Learn:

In [ ]:
print(
    'Aces coefficient:\t\t', lm.coef_[0], '\n'
    'DoubleFaults coefficient:\t', lm.coef_[1], '\n'
    'Intercept:\t\t\t', lm.intercept_
)
Aces coefficient:		 536.2988660480429 
DoubleFaults coefficient:	 3173.6078643403675 
Intercept:			 40577.862306813855

A popular method to evaluate the influence of each data point over a linear regression model (including likely outliers) is what's known as Cook's Distance, which measures the effect on the model of each observation once the observation has been removed. The higher Cook's Distance, the higher the influence (leverage) of the data point. Importantly, the size of the residual is not necessarily correlated with the size of its influence!

In [ ]:
cooks_dists = sm_model.get_influence().cooks_distance

fig, axes = plt.subplots(1,2, figsize=(10,5))
sns.scatterplot(x=X_model['Aces'], y=cooks_dists[0], alpha=0.4, ax=axes[0]).set_title('Cook\'s Distance for Aces', fontsize=12, weight='bold');
sns.scatterplot(x=X_model['DoubleFaults'], y=cooks_dists[0], alpha=0.4, ax=axes[1]).set_title('Cook\'s Distance for DoubleFaults', fontsize=12, weight='bold');
axes[0].set_ylabel('Distance');
axes[1].set_ylabel('Distance');

There are a couple ways we can use to check leverage. The first one is visually plotting the values of the hat matrix diagonal against the studentised residuals values (quotient obtained by dividing residuals by the standard deviation). An observation with an absolute value higher than 3 for a studentised residual is usually considered abnormally high, but some would argue this is not stringent enough.

In [ ]:
hat_diag = sm_model.get_influence().hat_matrix_diag
studentised_residuals = sm_model.get_influence().resid_studentized_external

fig = plt.figure(figsize=(5,5))
sns.regplot(x=hat_diag, y=studentised_residuals, fit_reg=False, scatter_kws={'alpha':0.4, 's':15}).set_title('Leverage v Studentised Residuals', fontsize=12, weight='bold')
plt.plot(hat_diag, [3]*len(hat_diag), c='r', alpha=0.4)
plt.plot(hat_diag, [-3]*len(hat_diag), c='r', alpha=0.4)
plt.ylabel('Studentised Residuals', fontsize=10)
plt.xlabel('Leverage', fontsize=10);

We can check these results in a tabular fashion:

In [ ]:
studentised_residuals = sm_model.get_influence().resid_studentized_external

df_st_resid = pd.DataFrame({'Index (Obs)': X_train.index, 'Stud Resid': studentised_residuals})

#   IMPORTANT: If you don't specify the index from the training set as a column (Index Obs), 
#   the index displayed in the results will correspond to the index of the `.redid_studentized_external` object, and NOT 
#   to the index of the original observation that was used to build the model, and you will wonder, for example, 
#   why the results show an index that doesn't reflect the original dataset, or why some plots show an index that doesn't correspond
#   to the size of the dataset used to create the model (see influence plot a couple of paragraphs below).

outliers = df_st_resid[abs(df_st_resid['Stud Resid']) > 3]
outliers.sort_values(by='Index (Obs)')
Out[ ]:
Index (Obs) Stud Resid
206 107 -4.821700
281 123 3.664859
250 313 4.512081
35 325 5.186609
107 555 5.671448
262 559 4.218032
10 892 -3.409213
88 982 3.000684
235 1103 4.171207

Overall, having only 9 problematic points (for now) in a training set of 306 observations is outstandingly good, although -- again -- some people would argue that a cut-off point of 3 is too low a bar. This can be fine-tuned later if required.

The second one way in which we can check for leverage is applying the leverage cutt-off point given by the formula: $$\frac{2k + 2}{n}$$ Where k is the number of predictors, and n is the number of observations in the model. Then we can use this to identify the points that have too much leverage:

In [ ]:
leverage_cutoff = ((2*len(lm.coef_))+2)/len(X_train) #about 0.0196 for this model

df_influence = pd.DataFrame({'Index (Obs)': X_train.index, 'Influence': hat_diag}) 
#   IMPORTANT: Again, if you don't specify the index from the training set as a column (Index Obs), 
#   the index displayed in the results will correspond to the index of the hat matrix, and NOT to the index of the original
#   observation that was used to build the model.

high_influence = df_influence[(df_influence['Influence'] > leverage_cutoff)]
high_influence.sort_values(by='Influence', ascending=False)
Out[ ]:
Index (Obs) Influence
206 107 0.404833
55 49 0.153951
90 1573 0.129802
222 336 0.110273
58 63 0.102671
18 561 0.093042
62 705 0.077757
19 927 0.055980
46 725 0.054473
24 221 0.052040
51 361 0.038056
215 430 0.036055
187 732 0.034388
195 562 0.033957
235 1103 0.032844
137 101 0.032379
139 1005 0.029607
122 531 0.028054
40 612 0.027748
303 924 0.024581
37 388 0.023643
250 313 0.023528
79 554 0.021527

This time 23 observations were flagged, and 3 are repeats from the previous method: Indices 107, 313, and 1103.

So now we are wondering: can we choose one method over the other to determine influence? The answer is: No. Since each method focuses on a different aspect of the observations, they diagnose something different about the points as you can see by the difference in the results in the last couple of lines. Let me explain.

We applied the studentised residual cut-off threshold of 3 over the "Studentised Residuals" axis of the data, in order to determine outliers. On the other hand, the leverage cut-off threshold is applied over the "Hat Matrix Diagonal" axis of the data Check again the "Leverage v Studentised Residuals" graph we created a couple of lines ago, and you can see why this makes sense: having a high Influence value does not necessarily correlate with having a Studentised Residual value.

It might be argued that being an outlier automatically qualifies for having influence, but we need to remember we are applying two different criteria over two different dimensions of the data. Additionally, and although commonly outliers have high influence (hence the argument), this is not necessarily the case every time for all datasets. It is always good practice to check both outliers and the leverage of each point to see if the points having the largest influence are also the same ones flagged as outliers. This makes more sense if we plot an additional cut-off line on our graph, so we can see the "quadrants" where outliers have high or low leverage, and what points have high leverage without being outliers.

In [ ]:
fig = plt.figure(figsize=(5,5))
sns.regplot(x=hat_diag, y=studentised_residuals, fit_reg=False, scatter_kws={'alpha':0.4, 's':15}).set_title('Leverage v Studentised Residuals', fontsize=12, weight='bold')
plt.plot(hat_diag, [3]*len(hat_diag), c='r', alpha=0.4)
plt.plot(hat_diag, [-3]*len(hat_diag), c='r', alpha=0.4)
plt.vlines(leverage_cutoff, ymin=-5, ymax=6, color='r', alpha=0.4)
plt.ylabel('Studentised Residuals', fontsize=10)
plt.xlabel('Leverage', fontsize=10);

Now, after we have seen and understood all of this, we can visualise influencial plots with statmodels' influence plot, which conveniently labels some of the most influential points with the index of the original observation.

In [ ]:
fig, ax = plt.subplots(figsize=(8,8));
sm.graphics.influence_plot(sm_model, criterion='cooks', ax=ax, alpha=0.3, );

We can see that, effectively, observation 107 is by far the most influential.

Now, it is tempting to just go and remove every data point that doesn't comply with our stipulations because that's what we've been taught to do, but it is important to remember that outliers sometimes are not simply "wrong" datapoints, the results of typos and machine malfunction: they can sometimes be real observations of the world and thus convey important information of the reality we are trying to model. This is especially true in human endeavour such as sports, earnings, hours worked in a week, and more, in which there are people who excel in scoring points, earning money, or working hours way beyond the vast majority of the population. This also calls into question whether certain datasets can be described by the chosen model (linear, logistic, etc), and how accurate is that decription.

Deleting these outliers is basically taking away information from the dataset, and thus the model will not take that information into consideration, for better or for worse. It is important to take pause and think about what we want to model, and disclose any caveats that we might have incurred into when creating a model, to make sure whoever uses it understands its limitations (for example, the range of income or hours worked in which the model is valid). In some cases the extreme observations are what we want to predict, in which case we should not remove them from the dataset.

Step 4 - Removing Influential Points to Improve the Model¶

What someone think is worth removing might be something worth keeping for someone else, and the trade-offs of the model accuracy vs the data represented by the model are for the modeller to consider. So the following will be an evaluation of the model without the points that we have so far identified as outliers or influential, which might not necessarily be what's ultimately best for the model.

In [ ]:
#Getting the indices to be removed
outliers_indices = outliers['Index (Obs)'].to_list()
high_influence_indices = high_influence['Index (Obs)'].to_list()
to_remove = list(set(outliers_indices + high_influence_indices))

#Removing these observations
for i in to_remove:
    X_train = X_train.drop(i)
    y_train = y_train.drop(i)
In [ ]:
lm.fit(X_train, y_train);

y_pred = lm.predict(X_test)
residuals = y_test - y_pred
In [ ]:
X_model = sm.add_constant(X_train)
sm_model = sm.OLS(y_train, X_model).fit()
sm_model.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: Winnings R-squared: 0.820
Model: OLS Adj. R-squared: 0.819
Method: Least Squares F-statistic: 623.9
Date: Tue, 11 Oct 2022 Prob (F-statistic): 9.78e-103
Time: 18:07:24 Log-Likelihood: -3442.8
No. Observations: 277 AIC: 6892.
Df Residuals: 274 BIC: 6902.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2.735e+04 4470.318 6.118 0.000 1.86e+04 3.62e+04
Aces 1007.3281 149.297 6.747 0.000 713.413 1301.243
DoubleFaults 2738.2429 378.121 7.242 0.000 1993.851 3482.635
Omnibus: 39.524 Durbin-Watson: 1.920
Prob(Omnibus): 0.000 Jarque-Bera (JB): 208.644
Skew: 0.373 Prob(JB): 4.94e-46
Kurtosis: 7.186 Cond. No. 94.0


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [ ]:
fig, axes = plt.subplots(2,3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.3, wspace=0.22)
sns.set(font_scale=0.8)

#Distribution of residuals
sns.histplot(residuals, ax=axes[0,0]).set_title('Distribution of Residuals', fontsize=12, weight='bold')
sm.qqplot(residuals, line='r', ax=axes[0,1], markersize=4)
axes[0,1].set_title('QQ Plot', fontsize=12, weight='bold')

#Homoscedasticity
sns.scatterplot(x=y_test, y=residuals, ax=axes[0,2], alpha=0.4).set_title('Homoscedasticity', fontsize=12, weight='bold')
axes[0,2].set_ylabel('Residuals')
sns.lineplot(x=y_test, y=[0]*len(y_test), ax=axes[0,2], color='r');

#Correlation between independent variables and residuals
sns.scatterplot(x=X_test['Aces'], y=residuals, ax=axes[1,0], alpha=0.4).set_title('Relationship between \nAces and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['Aces'], y=[0]*len(X_test['Aces']), ax=axes[1,0], color='r');
sns.scatterplot(x=X_test['DoubleFaults'], y=residuals, ax=axes[1,1], alpha=0.4).set_title('Relationship between \nDoubleFaults and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['DoubleFaults'], y=[0]*len(X_test['DoubleFaults']), ax=axes[1,1], color='r');

#Autocorrelation in residuals
sns.lineplot(x=residuals.index, y=residuals, ax=axes[1,2]).set_title('Autocorrelation of Residuals', fontsize=12, weight='bold');
axes[1,2].set_ylabel('Residuals')

subplts = [axes[0,1], axes[0,2], axes[1,0], axes[1,1], axes[1,2]]
colnames = X_test.columns

for i in range(len(subplts)):
    subplts[i].set_yticks([-400000, -200000, 0, 200000, 400000],['-400K', '-200K', '0', '200K', '400K'])

The model's R2 did not improve by much, and the model still does not do too well with the unknown data, likely because the unknown data (X_test) also contains outliers and highly influential points (remember we only detected them in, and removed from, the portion of the data we originally used to train the model):

In [ ]:
fig, axes = plt.subplots(1,2, figsize=(10,5))
sns.scatterplot(x=X_test['Aces'], y=y_test, alpha=0.4, ax=axes[0]).set_title('Aces and Winnings\nin Test Dataset', fontsize=12, weight='bold');
sns.scatterplot(x=X_test['DoubleFaults'], y=y_test, alpha=0.4, ax=axes[1]).set_title('DoubleFaults and Winnings\nin Test Dataset', fontsize=12, weight='bold');

The last iteration of this exercise will take one step back to try to fix the model by identifying and removing the outliers and highly influential points from the whole dataset. This time the outlier detection cut-off point will be more aggressive (2 instead of 3).

In [ ]:
X_model = sm.add_constant(X)
sm_model = sm.OLS(y, X_model).fit()
In [ ]:
hat_diag = sm_model.get_influence().hat_matrix_diag
studentised_residuals = sm_model.get_influence().resid_studentized_external

df_st_resid = pd.DataFrame({'Index (Obs)': X.index, 'Stud Resid': studentised_residuals})

outliers = df_st_resid[abs(df_st_resid['Stud Resid']) > 2]
outliers.sort_values(by='Index (Obs)')
Out[ ]:
Index (Obs) Stud Resid
44 56 2.739897
52 71 2.768442
74 101 2.275342
76 107 -4.931413
86 123 3.636454
132 221 -2.549095
160 288 2.098548
161 289 4.115735
166 308 5.263617
167 313 4.584519
171 325 5.115658
205 403 -4.355229
216 432 2.662136
228 483 2.173804
248 555 5.619228
251 559 4.238278
252 561 -2.332369
269 624 2.929817
281 672 -4.376502
284 696 2.852104
295 732 2.419758
296 735 -2.129421
320 892 -3.360578
331 982 3.059259
344 1076 2.993149
347 1103 4.270050
403 1473 -2.579514
In [ ]:
leverage_cutoff = ((2*len(lm.coef_))+2)/len(X) #about 0.0137 this time

df_influence = pd.DataFrame({'Index (Obs)': X.index, 'Influence': hat_diag}) 

high_influence = df_influence[(df_influence['Influence'] > leverage_cutoff)]
high_influence.sort_values(by='Influence', ascending=False)
Out[ ]:
Index (Obs) Influence
76 107 0.327400
281 672 0.144068
40 49 0.127635
176 336 0.089514
415 1573 0.081889
285 705 0.065956
48 63 0.062699
252 561 0.057447
284 696 0.048018
342 1068 0.044198
324 927 0.034569
132 221 0.034551
293 725 0.034326
158 284 0.032214
295 732 0.028805
403 1473 0.028789
253 562 0.026947
166 308 0.026098
74 101 0.026080
265 612 0.023696
188 361 0.023148
214 430 0.022738
205 403 0.021963
333 1005 0.021381
296 735 0.020666
347 1103 0.020288
64 85 0.019682
323 924 0.018797
242 531 0.018779
136 231 0.018441
247 554 0.018314
199 388 0.016137
370 1222 0.015783
304 793 0.015362
167 313 0.014513
In [ ]:
#Data
X = tennis_data[['Aces', 'DoubleFaults']]
y = tennis_data['Winnings']

#Getting the indices to be removed
outliers_indices = outliers['Index (Obs)'].to_list()
high_influence_indices = high_influence['Index (Obs)'].to_list()
to_remove = list(set(outliers_indices + high_influence_indices))

#Removing these observations
for i in to_remove:
    X = X.drop(i)
    y = y.drop(i)

#Split data, train, and test

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state=100)

lm.fit(X_train, y_train)

y_pred = lm.predict(X_test)
residuals = y_test - y_pred
In [ ]:
#Visualise and Diagnose

fig, axes = plt.subplots(2,3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.3, wspace=0.22)
sns.set(font_scale=0.8)

#Distribution of residuals
sns.histplot(residuals, ax=axes[0,0]).set_title('Distribution of Residuals', fontsize=12, weight='bold')
axes[0,0].set_xticks([-150000, -100000, -50000, 0, 50000, 100000, 150000],['-150K', '-100K', '-50K', '0', '50K', '100K', '150K'])
sm.qqplot(residuals, line='r', ax=axes[0,1], markersize=4)
axes[0,1].set_title('QQ Plot', fontsize=12, weight='bold')

#Homoscedasticity
sns.scatterplot(x=y_test, y=residuals, ax=axes[0,2], alpha=0.4).set_title('Homoscedasticity', fontsize=12, weight='bold')
axes[0,2].set_ylabel('Residuals')
sns.lineplot(x=y_test, y=[0]*len(y_test), ax=axes[0,2], color='r');

#Correlation between independent variables and residuals
sns.scatterplot(x=X_test['Aces'], y=residuals, ax=axes[1,0], alpha=0.4).set_title('Relationship between \nAces and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['Aces'], y=[0]*len(X_test['Aces']), ax=axes[1,0], color='r');
sns.scatterplot(x=X_test['DoubleFaults'], y=residuals, ax=axes[1,1], alpha=0.4).set_title('Relationship between \nDoubleFaults and Residuals (Winnings)', fontsize=12, weight='bold')
sns.lineplot(x=X_test['DoubleFaults'], y=[0]*len(X_test['DoubleFaults']), ax=axes[1,1], color='r');

#Autocorrelation in residuals
sns.lineplot(x=residuals.index, y=residuals, ax=axes[1,2]).set_title('Autocorrelation of Residuals', fontsize=12, weight='bold');
axes[1,2].set_ylabel('Residuals')

subplts = [axes[0,1], axes[0,2], axes[1,0], axes[1,1], axes[1,2]]
colnames = X_test.columns

for i in range(len(subplts)):
    subplts[i].set_yticks([-400000, -200000, 0, 200000, 400000],['-400K', '-200K', '0', '200K', '400K'])
In [ ]:
#Diagnose
X_model = sm.add_constant(X_train)
sm_model = sm.OLS(y_train, X_model).fit()
sm_model.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: Winnings R-squared: 0.823
Model: OLS Adj. R-squared: 0.821
Method: Least Squares F-statistic: 623.4
Date: Tue, 11 Oct 2022 Prob (F-statistic): 1.01e-101
Time: 18:07:25 Log-Likelihood: -3319.1
No. Observations: 272 AIC: 6644.
Df Residuals: 269 BIC: 6655.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2.669e+04 3551.116 7.515 0.000 1.97e+04 3.37e+04
Aces 492.2228 122.894 4.005 0.000 250.267 734.179
DoubleFaults 3421.4952 291.065 11.755 0.000 2848.440 3994.551
Omnibus: 18.608 Durbin-Watson: 1.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 51.668
Skew: 0.188 Prob(JB): 6.03e-12
Kurtosis: 5.102 Cond. No. 80.7


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [ ]:
fig = plt.figure(figsize=(5,5))
sns.regplot(x=y_test, y=y_pred, scatter_kws={'alpha':0.4, 's':15}).set_title('Actual v Predicted Winnings', fontsize=12, weight='bold')
plt.ylabel('Predicted Winnings', fontsize=10)
plt.xlabel('Actual Winnings', fontsize=10);

The distribution of the residuals improved greatly, however we still have heteroscedasticity and other issues.

We could keep iterating to fix this model by removing data and increasing our score (R2 value) and trying to fix the other issues, however we would be removing more and more data just to make the model work better. With my current level of stringency, I have already removed 49 out of 438 observations (more than 10%!). What if this is as good as the model will get with this dataset? We can't just eliminate half of the dataset to get a perfect score. Alternatively, we could go back and delete outliers only but not highly influential points. There are many options we could explore, but this exercise is already long as it is.

An R2 value of 0.82 can be considered very good considering that:

  • Real-world data is messy and will contain outliers and other problematic observations, which are not always erroneous, and
  • Real-world data is hardly ever linear, unless it has been collected within a very precise framework under tighly controlled circumstances. Data about humans in sports? You get the idea.

Additionally, it is important to remember that we have chosen (for the sake of the exercise) to fit a linear regression model. It is possible that other models might fit the data better, but exploring those options escapes the scope of this exercise.

So far, we can be happy with this model. What do you think?

Thanks for reading!