Creating and Evaluating a Model to Detect Fraud Using Synthetic Payment Data¶

Overview¶

This is a short project where I use synthetic payment payment data created by Edgar Lopez-Rojas at Kaggle to create and evaluate (first) a Logistic Regression algorithm -- and then use a support vector machine to solve the LR model's limitations -- to detect likely financial fraud based on several features of the data. Please note: I am not a financial expert and the commentary here is based on my restricted knowledge of day-to-day finances and is for display purposes only. The analysis presented here does not constitute financial advice of any kind.

IMPORTANT: Rather than showing you how to build the perfect LR system using standard machine learning approaches, this short project is intended to illustrate the complexities of using real-world data to detect something as uncommon as financial fraud, and why special attention is required to detect and evaluate the actual performance of the model.

If you find something that you think is wrong or could be done better or simpler, feel free to contact me!


The data comes with the following info and description:

[This is] a synthetic dataset generated using the simulator called Paysim [that] uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company [...].

Headers:

-step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

-type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

-amount - amount of the transaction in local currency.

-nameOrig - customer who started the transaction

-oldbalanceOrg - initial balance before the transaction

-newbalanceOrig - new balance after the transaction

-nameDest - customer who is the recipient of the transaction

-oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

-newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

-isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

-isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.


The pipeline uses seaborn, pandas, numpy, matplotlib and SciKit Learn.

Step 1 - Initial Exploratory Data Analysis.¶

In [ ]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
In [ ]:
# Load the data
pay_data = pd.read_csv('paysim_data.csv')

First, and since we want to first build a Logistic Regression model that relies on observations being independent, let's check if we have more than one observation coming from the same initial costumer:

In [ ]:
pay_data['nameOrig'].value_counts()
Out[ ]:
C1902386530    3
C363736674     3
C545315117     3
C724452879     3
C1784010646    3
              ..
C98968405      1
C720209255     1
C1567523029    1
C644777639     1
C1280323807    1
Name: nameOrig, Length: 6353307, dtype: int64

And since do, we'll start by dropping all "duplicate" transactions (i.e. transactions that come from the same initial costumer):

In [ ]:
pay_data = pay_data.drop_duplicates(subset='nameOrig')

Additionally, we'll check if any observation has amount == 0 just by mistake, since we are not interested in these observations:

In [ ]:
pay_data[pay_data['amount'] == 0]
Out[ ]:
step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud
2736447 212 CASH_OUT 0.0 C1510987794 0.0 0.0 C1696624817 0.00 0.00 1 0
3247298 250 CASH_OUT 0.0 C521393327 0.0 0.0 C480398193 0.00 0.00 1 0
3760289 279 CASH_OUT 0.0 C539112012 0.0 0.0 C1106468520 538547.63 538547.63 1 0
5563714 387 CASH_OUT 0.0 C1294472700 0.0 0.0 C1325541393 7970766.57 7970766.57 1 0
5996408 425 CASH_OUT 0.0 C832555372 0.0 0.0 C1462759334 76759.90 76759.90 1 0
5996410 425 CASH_OUT 0.0 C69493310 0.0 0.0 C719711728 2921531.34 2921531.34 1 0
6168500 554 CASH_OUT 0.0 C10965156 0.0 0.0 C1493336195 230289.66 230289.66 1 0
6205440 586 CASH_OUT 0.0 C1303719003 0.0 0.0 C900608348 1328472.86 1328472.86 1 0
6266414 617 CASH_OUT 0.0 C1971175979 0.0 0.0 C1352345416 0.00 0.00 1 0
6281483 646 CASH_OUT 0.0 C2060908932 0.0 0.0 C1587892888 0.00 0.00 1 0
6281485 646 CASH_OUT 0.0 C1997645312 0.0 0.0 C601248796 0.00 0.00 1 0
6296015 671 CASH_OUT 0.0 C1960007029 0.0 0.0 C459118517 27938.72 27938.72 1 0
6351226 702 CASH_OUT 0.0 C1461113533 0.0 0.0 C1382150537 107777.02 107777.02 1 0
6362461 730 CASH_OUT 0.0 C729003789 0.0 0.0 C1388096959 1008609.53 1008609.53 1 0
6362463 730 CASH_OUT 0.0 C2088151490 0.0 0.0 C1156763710 0.00 0.00 1 0
6362585 741 CASH_OUT 0.0 C312737633 0.0 0.0 C1400061387 267522.87 267522.87 1 0

And since we do, we'll drop them next. Interestingly, all of these observations are labeled as fraud (isFraud == 1) but the amount transferred is 0, which seems inconsistent with the definition for this variable in the introduction.

In [ ]:
pay_data = pay_data[(pay_data['amount'] > 0)]

And now let's explore the data!

In [ ]:
#Explore the data
pay_data.head(15)
Out[ ]:
step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud
0 1 PAYMENT 9839.64 C1231006815 170136.00 160296.36 M1979787155 0.0 0.00 0 0
1 1 PAYMENT 1864.28 C1666544295 21249.00 19384.72 M2044282225 0.0 0.00 0 0
2 1 TRANSFER 181.00 C1305486145 181.00 0.00 C553264065 0.0 0.00 1 0
3 1 CASH_OUT 181.00 C840083671 181.00 0.00 C38997010 21182.0 0.00 1 0
4 1 PAYMENT 11668.14 C2048537720 41554.00 29885.86 M1230701703 0.0 0.00 0 0
5 1 PAYMENT 7817.71 C90045638 53860.00 46042.29 M573487274 0.0 0.00 0 0
6 1 PAYMENT 7107.77 C154988899 183195.00 176087.23 M408069119 0.0 0.00 0 0
7 1 PAYMENT 7861.64 C1912850431 176087.23 168225.59 M633326333 0.0 0.00 0 0
8 1 PAYMENT 4024.36 C1265012928 2671.00 0.00 M1176932104 0.0 0.00 0 0
9 1 DEBIT 5337.77 C712410124 41720.00 36382.23 C195600860 41898.0 40348.79 0 0
10 1 DEBIT 9644.94 C1900366749 4465.00 0.00 C997608398 10845.0 157982.12 0 0
11 1 PAYMENT 3099.97 C249177573 20771.00 17671.03 M2096539129 0.0 0.00 0 0
12 1 PAYMENT 2560.74 C1648232591 5070.00 2509.26 M972865270 0.0 0.00 0 0
13 1 PAYMENT 11633.76 C1716932897 10127.00 0.00 M801569151 0.0 0.00 0 0
14 1 PAYMENT 4098.78 C1026483832 503264.00 499165.22 M1635378213 0.0 0.00 0 0
In [ ]:
print('\nLength of Dataset: ', len(pay_data))
print('\nColumns: ', pay_data.columns)
Length of Dataset:  6353291

Columns:  Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

The dataset is over 6M rows! It comes with an important column (isFraud) which we will use later to train our model. Also, there is a slight discrepancy between the column names oldbalanceOrg and newbalanceOrig so to make them consistent (for memory's sake), we'll change oldbalanceOrg to oldbalanceOrig.

In [ ]:
#Since Pandas does not support the easier mutable operations, we'll change it the hard way:
colnames = pay_data.columns.tolist()
colnames[colnames.index('oldbalanceOrg')] = 'oldbalanceOrig'
pay_data.columns = colnames
pay_data.columns
Out[ ]:
Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrig',
       'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest',
       'isFraud', 'isFlaggedFraud'],
      dtype='object')

Now we have to do some QC'ing and exploration before we can move forward (i.e. check whether we have missing values, typos, incoherent data types (e.g. data type object when it should be int or float), what is the range and distribution of the data in each columns, outliers, etcetera).

In [ ]:
pay_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6353291 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrig  float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 581.7+ MB
In [ ]:
pay_data.isna().sum()
Out[ ]:
step              0
type              0
amount            0
nameOrig          0
oldbalanceOrig    0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64
In [ ]:
pay_data.describe()
Out[ ]:
step amount oldbalanceOrig newbalanceOrig oldbalanceDest newbalanceDest isFraud isFlaggedFraud
count 6.353291e+06 6.353291e+06 6.353291e+06 6.353291e+06 6.353291e+06 6.353291e+06 6.353291e+06 6.353291e+06
mean 2.432836e+02 1.798499e+05 8.339021e+05 8.551329e+05 1.100557e+06 1.224843e+06 1.287364e-03 2.518380e-06
std 1.423257e+02 6.038027e+05 2.888229e+06 2.924031e+06 3.398649e+06 3.673608e+06 3.585676e-02 1.586939e-03
min 1.000000e+00 1.000000e-02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 1.550000e+02 1.338926e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
50% 2.390000e+02 7.487491e+04 1.421000e+04 0.000000e+00 1.327156e+05 2.146720e+05 0.000000e+00 0.000000e+00
75% 3.340000e+02 2.087207e+05 1.073220e+05 1.442728e+05 9.429916e+05 1.111917e+06 0.000000e+00 0.000000e+00
max 7.430000e+02 9.244552e+07 5.958504e+07 4.958504e+07 3.560159e+08 3.561793e+08 1.000000e+00 1.000000e+00

The data looks severely skewed (check rows min, 25%, 50%, 75% and max for columns oldbalanceOrig, newbalanceOrig, oldbalanceDest and newbalanceDest) and extremely variable (check std for same columns). Let's check these out visually. Be mindful we'll use the log scale in some of the x axes of the histograms.

In [ ]:
fig, axs = plt.subplots(2,5, figsize=(20,10))

axs[0,0].hist(pay_data['amount'], bins=50, color = 'tab:blue')
axs[0,0].set_xscale('log')
axs[0,0].set_ylabel('Frequency', fontsize=14)
axs[0,0].set_title('amount', fontsize=14)
axs[1,0].boxplot(pay_data['amount'])
axs[1,0].set_ylabel('IQR + Outliers', fontsize=14)

axs[0,1].hist(pay_data['oldbalanceOrig'], bins=50, color = 'tab:orange')
axs[0,1].set_title('oldbalanceOrig', fontsize=14)
axs[1,1].boxplot(pay_data['oldbalanceOrig'])

axs[0,2].hist(pay_data['newbalanceOrig'], bins=50, color = 'tab:green')
axs[0,2].set_xscale('log')
axs[0,2].set_title('newbalanceOrig', fontsize=14)
axs[1,2].boxplot(pay_data['newbalanceOrig'])

axs[0,3].hist(pay_data['oldbalanceDest'], bins=50, color = 'tab:red')
axs[0,3].set_xscale('log')
axs[0,3].set_title('oldbalanceDest', fontsize=14)
axs[1,3].boxplot(pay_data['oldbalanceDest'])

axs[0,4].hist(pay_data['newbalanceDest'], bins=50, color = 'tab:purple')
axs[0,4].set_title('newbalanceDest', fontsize=14)
axs[1,4].boxplot(pay_data['newbalanceDest'])

plt.show()

Damn! The data is ridiculously skewed! But it is somewhat expected: remember from the description that there is no information about merchants for oldbalanceDest and newbalanceDest for merchant customers (nameOrig and nameDest that start with an 'M')? That probably explains some of the skewness in those variables. Why? If we think about what the data represents (from the description of the data from Kaggle, see the Overview):

[This is] a synthetic dataset [...] that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company.

we can understand why the data is so heavily skewed to smaller values: most people do not transfer such insane amounts of money between accounts! The 'abnormalities' come from the other side of the distribution, but we need to take time to decide about them, because their exclusion (and to what degree) will affect the performance of the model. We need to explore the data in more detail before making decisions.

Let's start checking out information that involves merchants and customers separately.

In [ ]:
n_merchants_nameDest = len(pay_data['nameDest'][pay_data['nameDest'].str.contains('M')])
n_merchants_nameOrig = len(pay_data['nameOrig'][pay_data['nameOrig'].str.contains('M')])
perc_merchants_nameDest = round((n_merchants_nameDest/len(pay_data))*100,2)
perc_merchants_nameOrig = round((n_merchants_nameOrig/len(pay_data))*100,2)

print('Number of transaction where destination is a merchant: ', n_merchants_nameDest)
print('Number of transaction where origin is a merchant: ', n_merchants_nameOrig)
print('Percentage of transaction where destination is a merchant: ', perc_merchants_nameDest,'%')
print('Percentage of transaction where origin is a merchant: ', perc_merchants_nameOrig,'%')
Number of transaction where destination is a merchant:  2148333
Number of transaction where origin is a merchant:  0
Percentage of transaction where destination is a merchant:  33.81 %
Percentage of transaction where origin is a merchant:  0.0 %
In [ ]:
n_customers_nameDest = len(pay_data['nameDest'][pay_data['nameDest'].str.contains('C')])
n_customers_nameOrig = len(pay_data['nameOrig'][pay_data['nameOrig'].str.contains('C')])
perc_customers_nameDest = round((n_customers_nameDest/len(pay_data))*100,2)
perc_customers_nameOrig = round((n_customers_nameOrig/len(pay_data))*100,2)

print('Number of transaction where destination is a customer: ', n_customers_nameDest)
print('Number of transaction where origin is a customer: ', n_customers_nameOrig)
print('Percentage of transaction where destination is a customer: ', perc_customers_nameDest,'%')
print('Percentage of transaction where origin is a customer: ', perc_customers_nameOrig,'%')
Number of transaction where destination is a customer:  4204958
Number of transaction where origin is a customer:  6353291
Percentage of transaction where destination is a customer:  66.19 %
Percentage of transaction where origin is a customer:  100.0 %

Looks like merchants do not initiate transactions in this dataset, and account for nearly 34% of the destinations of these transactions. Customers are the only ones initiating transactions, and account for the remaining 66% of the destinations of these transactions.

Let's check out for now the transactions that are done customer-customer and customer-merchant, since there are no transactions originated from merchants.

In [ ]:
customer_customer = pay_data[pay_data['nameOrig'].str.contains('C') & pay_data['nameDest'].str.contains('C')]
customer_customer.head()
Out[ ]:
step type amount nameOrig oldbalanceOrig newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud
2 1 TRANSFER 181.00 C1305486145 181.0 0.00 C553264065 0.0 0.00 1 0
3 1 CASH_OUT 181.00 C840083671 181.0 0.00 C38997010 21182.0 0.00 1 0
9 1 DEBIT 5337.77 C712410124 41720.0 36382.23 C195600860 41898.0 40348.79 0 0
10 1 DEBIT 9644.94 C1900366749 4465.0 0.00 C997608398 10845.0 157982.12 0 0
15 1 CASH_OUT 229133.94 C905080434 15325.0 0.00 C476402209 5083.0 51513.44 0 0
In [ ]:
customer_customer['type'].unique()
Out[ ]:
array(['TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN'], dtype=object)
In [ ]:
customer_customer.describe()
Out[ ]:
step amount oldbalanceOrig newbalanceOrig oldbalanceDest newbalanceDest isFraud isFlaggedFraud
count 4.204958e+06 4.204958e+06 4.204958e+06 4.204958e+06 4.204958e+06 4.204958e+06 4.204958e+06 4.204958e+06
mean 2.427839e+02 2.650655e+05 1.225089e+06 1.260426e+06 1.662837e+06 1.850621e+06 1.945085e-03 3.805032e-06
std 1.421379e+02 7.275209e+05 3.482954e+06 3.523145e+06 4.064136e+06 4.385451e+06 4.406021e-02 1.950646e-03
min 1.000000e+00 1.000000e-02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 1.550000e+02 7.608690e+04 0.000000e+00 0.000000e+00 1.396720e+05 2.217264e+05 0.000000e+00 0.000000e+00
50% 2.370000e+02 1.589422e+05 1.815700e+04 0.000000e+00 5.512318e+05 6.837139e+05 0.000000e+00 0.000000e+00
75% 3.340000e+02 2.784826e+05 1.911609e+05 2.849697e+05 1.692878e+06 1.910872e+06 0.000000e+00 0.000000e+00
max 7.430000e+02 9.244552e+07 5.958504e+07 4.958504e+07 3.560159e+08 3.561793e+08 1.000000e+00 1.000000e+00
In [ ]:
customer_merchant = pay_data[pay_data['nameOrig'].str.contains('C') & pay_data['nameDest'].str.contains('M')]
customer_merchant.head()
Out[ ]:
step type amount nameOrig oldbalanceOrig newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud
0 1 PAYMENT 9839.64 C1231006815 170136.0 160296.36 M1979787155 0.0 0.0 0 0
1 1 PAYMENT 1864.28 C1666544295 21249.0 19384.72 M2044282225 0.0 0.0 0 0
4 1 PAYMENT 11668.14 C2048537720 41554.0 29885.86 M1230701703 0.0 0.0 0 0
5 1 PAYMENT 7817.71 C90045638 53860.0 46042.29 M573487274 0.0 0.0 0 0
6 1 PAYMENT 7107.77 C154988899 183195.0 176087.23 M408069119 0.0 0.0 0 0
In [ ]:
customer_merchant['type'].unique()
Out[ ]:
array(['PAYMENT'], dtype=object)
In [ ]:
customer_merchant.describe()
Out[ ]:
step amount oldbalanceOrig newbalanceOrig oldbalanceDest newbalanceDest isFraud isFlaggedFraud
count 2.148333e+06 2.148333e+06 2.148333e+06 2.148333e+06 2148333.0 2148333.0 2148333.0 2148333.0
mean 2.442619e+02 1.305651e+04 6.822677e+04 6.184797e+04 0.0 0.0 0.0 0.0
std 1.426877e+02 1.255499e+04 1.990730e+05 1.970739e+05 0.0 0.0 0.0 0.0
min 1.000000e+00 2.000000e-02 0.000000e+00 0.000000e+00 0.0 0.0 0.0 0.0
25% 1.560000e+02 4.383570e+03 0.000000e+00 0.000000e+00 0.0 0.0 0.0 0.0
50% 2.490000e+02 9.481460e+03 1.053000e+04 0.000000e+00 0.0 0.0 0.0 0.0
75% 3.350000e+02 1.756005e+04 6.089100e+04 4.966411e+04 0.0 0.0 0.0 0.0
max 7.180000e+02 2.386380e+05 4.368662e+07 4.367380e+07 0.0 0.0 0.0 0.0

There are several interesting features and differences between customer-customer and customer-merchant transactions. The first one is that customer-merchant transactions are only 'PAYMENT' type, while customer-customer involve 'TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN' types. The distribution of the amount column is different as well: much higher amounts between customers than between a costumer and a merchant.

However the most important difference comes from the columns isFraud and isFlaggedFraud: there is only malicious activity in customer-customer payments (denoted as '1' in those columns), not in customer-merchant payments... according to the dataset and the author, at least. One might say "it's obvious, since fraud almost exclusively happens between individuals rather than involving merchants"... and I would disagree. Since the data doesn't contain fraudulent behaviour between customers and merchants (only bona fide payments), we could focus on the data from customer-customer transfers, even though we would have lost about 34% of the data! The problem is that the inclusion of the customer_merchant data does not help our model: how can we evaluate that the model is capable of detecting fraud amongst customer-merchant transactions if the data doesn't have the appropriate labels? All labels from this dataset would be the same (isFraud == 0 and isFlaggedFraud == 0) and thus there is no benchmarking of the model against this part of the dataset. But let's do a bit more exploration before making up our minds.

Let's check the general features and distributions of the isFraud data and compare frauds (isFraud == 1) with non-frauds (isFraud == 0):

In [ ]:
fig, axes = plt.subplots(1,2, figsize=(20, 5))

sns.barplot(x='type', y='amount', data=pay_data, ax=axes[0], hue='isFraud').set_title('Mean Amount Transferred (in Millions),\nby Movement Type', fontsize=16, weight='bold');
sns.histplot(x='amount', data=pay_data, ax=axes[1], hue='isFraud', stat='density', element='step', common_norm=False, bins=20, log_scale=True).set_title('Density Distribution for Amount Transferred,\nby Fraud Type', fontsize=16, weight='bold');

Note that the histogram created has common_norm set to False and Density is plotted rather than Frequency as it is usual. This is because there are very, very few observations that match isFraud == 1 and with the default of True these observation are virtually invisible. This is of outstanding consideration because when training a model we need a good amount of observations for each label, otherwise the model will fail to learn how to predict each label and differentiate between them. We can check that there are so few said observations numerically:

In [ ]:
frauds = pay_data[(pay_data['isFraud'] == 1)]
non_frauds = pay_data[(pay_data['isFraud'] == 0)]

print(
    'Total number of fraud transactions: ', + len(frauds), '\n'
    'Total number of transactions: ', len(pay_data), '\n'
    'Percentage of fraud transactions in the dataset: ', round((len(frauds)/len(pay_data))*100,3), '%'
)
Total number of fraud transactions:  8179 
Total number of transactions:  6353291 
Percentage of fraud transactions in the dataset:  0.129 %

After separating the data, we can check it again visually to make better sense of it (note the difference in the scale used for counts on each histogram).

In [ ]:
ig, axes = plt.subplots(1,2, figsize=(20, 5))

sns.histplot(x='amount', data=non_frauds, ax=axes[0], bins=20, log_scale=True).set_title('Distribution for `amount` in Non-Frauds Data', fontsize=16, weight='bold');
sns.histplot(x='amount', data=frauds, ax=axes[1], bins=20, log_scale=True).set_title('Distribution for `amount` in Frauds Data', fontsize=16, weight='bold');

Also, it seems that the fraudulent transactions are type == TRANSFER and type == CASH_OUT only. Let's see if this is true:

In [ ]:
frauds['type'].unique()
Out[ ]:
array(['TRANSFER', 'CASH_OUT'], dtype=object)

This gives us an important clue about where to put our attention when training our model. Let's see if we can train and evaluate a model to discriminate between frauds and non-frauds considering the overlapping of the data in terms of amounts and transfer type.

Step 2 - Training a Basic Logistic Regression Model to Evaluate Performance and Model Assumptions.¶

Since the outcome variable is binary (fraud/non-fraud) let's start with a logistic regression implementation. But what features from the dataset to use? We can use mutual_info_classif from sklearn to output what features contribute the most to predicting the outcome variable. However, since this method does not work with categorical variables that have not been encoded (e.g. type), we need to do so first. We'll evaluate all relevant variables from the original dataframe (i.e. we'll only exclude names of origin and destination).

In [ ]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

X = pay_data[['type', 'amount', 'oldbalanceOrig', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']]
X_enc = X.copy()
X_enc['type'] = le.fit_transform(X['type'])

y = pay_data['isFraud']
In [ ]:
from sklearn.feature_selection import mutual_info_classif

results = pd.DataFrame({'Feature': X_enc.columns, 'Mutual Info with Target': mutual_info_classif(X_enc,y,random_state=0)})
results.sort_values('Mutual Info with Target', ascending=False)
Out[ ]:
Feature Mutual Info with Target
0 type 0.170757
2 oldbalanceOrig 0.002614
1 amount 0.002437
3 newbalanceOrig 0.000633
4 oldbalanceDest 0.000167
5 newbalanceDest 0.000113

So as we could see from our initial EDA, the most important difference between frauds and non-frauds comes from the variable type, while amount ranks much lower due to the clear overlap in distributions we saw earlier. This is basically confirming some of our previous asumptions. All other variable seem rather independent from the target.

Logistic regression relies on several assumptions that we need to evaluate if we want to be sure the model is correctlly specified. Namely,

  • Independence of observations,
  • Outcome variable must be binary,
  • Sufficiently large sample size,
  • No multicollinearity,
  • Linearity of independent variables and log-odds, and
  • No strongly influential outliers.

Let's build a model using the top three features from our earlier analysis (type, amount and oldbalanceOrig) and evaluate it. We'll also evaluate the classifier's performance.

In [ ]:
X = pay_data[['type', 'amount', 'oldbalanceOrig']]
X['type'] = le.fit_transform(X['type'])
y = pay_data['isFraud']
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state=0)
In [ ]:
#   Considering the differences in the scales of the different variables, it is better to transform them 
#   to prevent them from having too much weight on the model.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [ ]:
lr = LogisticRegression()
lr.fit(X_train, y_train);
In [ ]:
print(
    'Train score:\t', lr.score(X_train, y_train),
    '\nTest score:\t', lr.score(X_test, y_test)
)
Train score:	 0.9986942648162268 
Test score:	 0.9986967389091642

Looks too good to be true, but we need to look closer. For a classifier, it is usually better to print a confusion matrix, which effectively shows us the number of true positives, false positives, false negatives, and true negatives from our classification. See the explanation below.

In [ ]:
y_predict = lr.predict(X_test)
In [ ]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_predict)
Out[ ]:
array([[1903501,      42],
       [   2442,       3]])

The Confusion Matrix tells us the number of:

-True Negatives (1903501, top left, or index [0][0]), which for our model is the number of transactions that were not fraud and that were classified as such.

-False Positives (42, top right, or index [0][1]), which is the number of transactions that were not fraud but were classified as fraud.

-False Negatives (2442 bottom left, or index [1][0]), which is the number of transactions that were frauds but were not classified as such.

-True Positives (3 bottom right, or index [1][1]), which is the number of transactions that were actual fraud and that were classified as such.

This basically tells us that the classifier is doing its intended job terribly, as it is not picking up any actual fraud. Furthermore, it is classifying some other non-fraudulent activity as fraud. The thing that the model does superbly good (which is reflected in our previous score and why we can see it is misleading) is classifying non-fraud as non-fraud. But this is not what we need if we want it to detect fraud! The score we saw earlier is a reflection of how good the model is at classifying things as they are (non-frauds as non-frauds, and frauds as frauds), but regardless of whether they are true positives or true negatives.

This point is driven further home by showing that the usual metrics that we use to evaluate our model all yield very close to 0 since they all use True Positives as numerator in the calculation:

In [ ]:
from sklearn.metrics import recall_score, precision_score, f1_score
print(
    'Recall Score:\t', recall_score(y_test, y_predict), '\n'
    'Precision Score:\t', precision_score(y_test, y_predict), '\n'
    'F1 Score:\t', f1_score(y_test, y_predict)
)
Recall Score:	 0.001226993865030675 
Precision Score:	 0.06666666666666667 
F1 Score:	 0.0024096385542168677

When we think that the Precision Score is "intuitively the ability of the classifier not to label as positive a sample that is negative", while the Recall Score is "intuitively the ability of the classifier to find all the positive samples" -- and that the F1 Score balances the previous two scores -- we need them all to be as close to 1 as possible to be satisfied with our classifier.

If we think about it, this is most likely due to the clear absence of frauds in our original dataset (remember we only had about 0.13% of transactions that were frauds). A machine learning model learns in a way similar to humans: it needs appropriate exposure to the frauds and non-frauds data to determine what each of them look like. This is best exemplified using datasets with different proportions of frauds and non-frauds. See how the numbers in the confusion matrix (as well as the recall, precision and F1 scores) change with different proportions in the following example.

In [ ]:
# First we'll determine the samples and what proportions of the dataset they'll represent when a new dataset is created
# by combining the sample with all the frauds data 

proportions = [i*(0.1*i) for i in range(1, 10)]
lengths = [i * len(frauds) for i in proportions]
lengths = [int(i) for i in lengths]

for i in lengths:
    print(f'Sample of {i} from non-fraud data equals {round((i*100/(i + len(frauds))),2)}% of the total dataset when the sample and the frauds data are combined')
Sample of 817 from non-fraud data equals 9.08% of the total dataset when the sample and the frauds data are combined
Sample of 3271 from non-fraud data equals 28.57% of the total dataset when the sample and the frauds data are combined
Sample of 7361 from non-fraud data equals 47.37% of the total dataset when the sample and the frauds data are combined
Sample of 13086 from non-fraud data equals 61.54% of the total dataset when the sample and the frauds data are combined
Sample of 20447 from non-fraud data equals 71.43% of the total dataset when the sample and the frauds data are combined
Sample of 29444 from non-fraud data equals 78.26% of the total dataset when the sample and the frauds data are combined
Sample of 40077 from non-fraud data equals 83.05% of the total dataset when the sample and the frauds data are combined
Sample of 52345 from non-fraud data equals 86.49% of the total dataset when the sample and the frauds data are combined
Sample of 66249 from non-fraud data equals 89.01% of the total dataset when the sample and the frauds data are combined
In [ ]:
for i in lengths:
    #Create the dataset combining the sample of different sizes with the frauds data:
    non_fraud_sample = non_frauds.sample(n = i, random_state=100)
    df = pd.concat([non_fraud_sample, frauds])

    #Repeat the feature and label selection, data splitting, training and evaluation of the model:
    pre_X_df = df[['type', 'amount', 'oldbalanceOrig']]
    X_df = pre_X_df.copy()
    X_df['type'] = le.fit_transform(pre_X_df['type'])
    y_df = df['isFraud']
    
    X_train_df, X_test_df, y_train_df, y_test_df = train_test_split(X_df, y_df, train_size = 0.7, test_size = 0.3, random_state=100)

    X_train_df = scaler.fit_transform(X_train_df)
    X_test_df = scaler.transform(X_test_df)

    lr.fit(X_train_df, y_train_df)#.values.ravel())

    #And we will use the model to make predictions on the real-world data (the original split `X_test` from the dataset)
    #so we can compare straight away how the model performs and what was the best proportion of labels to work with
    y_predict = lr.predict(X_test)

    #Print the confusion matrices and relevant scores 
    print(
        'For sample size of', i, ':', '\n'
        'Confusion matrix:' '\n', confusion_matrix(y_test, y_predict), '\n'
        'Recall Score: ', recall_score(y_test, y_predict), '\n'
        'Precision Score: ', precision_score(y_test, y_predict), '\n'
        'F1 Score: ', f1_score(y_test, y_predict), '\n'
        '\n', '---------------------', '\n'
    )
For sample size of 817 : 
Confusion matrix:
 [[  63658 1839885]
 [      0    2445]] 
Recall Score:  1.0 
Precision Score:  0.0013271238051814822 
F1 Score:  0.0026507297637923324 

 --------------------- 

For sample size of 3271 : 
Confusion matrix:
 [[  91661 1811882]
 [      0    2445]] 
Recall Score:  1.0 
Precision Score:  0.0013476071292550902 
F1 Score:  0.0026915870566036905 

 --------------------- 

For sample size of 7361 : 
Confusion matrix:
 [[ 321874 1581669]
 [    147    2298]] 
Recall Score:  0.939877300613497 
Precision Score:  0.0014507878005034197 
F1 Score:  0.0028971036527711594 

 --------------------- 

For sample size of 13086 : 
Confusion matrix:
 [[1469564  433979]
 [    458    1987]] 
Recall Score:  0.812678936605317 
Precision Score:  0.004557694866113413 
F1 Score:  0.009064553581000476 

 --------------------- 

For sample size of 20447 : 
Confusion matrix:
 [[1721103  182440]
 [    928    1517]] 
Recall Score:  0.6204498977505113 
Precision Score:  0.008246492386807787 
F1 Score:  0.016276649392173905 

 --------------------- 

For sample size of 29444 : 
Confusion matrix:
 [[1797528  106015]
 [   1226    1219]] 
Recall Score:  0.4985685071574642 
Precision Score:  0.011367663241136207 
F1 Score:  0.022228503177454208 

 --------------------- 

For sample size of 40077 : 
Confusion matrix:
 [[1827144   76399]
 [   1444    1001]] 
Recall Score:  0.40940695296523516 
Precision Score:  0.0129328165374677 
F1 Score:  0.025073580061368905 

 --------------------- 

For sample size of 52345 : 
Confusion matrix:
 [[1848547   54996]
 [   1589     856]] 
Recall Score:  0.35010224948875257 
Precision Score:  0.015326219293848026 
F1 Score:  0.02936686278882275 

 --------------------- 

For sample size of 66249 : 
Confusion matrix:
 [[1864295   39248]
 [   1711     734]] 
Recall Score:  0.3002044989775051 
Precision Score:  0.018358261217547897 
F1 Score:  0.034600608103330426 

 --------------------- 

As you can see, the higher the sample size combined with the frauds data (i.e. the more we dilute the frauds with normal transactions) the more trouble the model has to distinguish frauds from non-frauds, as expected. Let's move forward with a model that is balanced (50% frauds, 50% non-frauds) for the sake of the exercise.

In [ ]:
non_fraud_sample = non_frauds.sample(n = len(frauds), random_state=100)
pay_df = pd.concat([non_fraud_sample, frauds])

Let's get back to the assumptions of logistic regression:

  • Independence of observations,
  • Outcome variable must be binary,
  • Sufficiently large sample size,
  • No multicollinearity,
  • Linearity of independent variables and log-odds, and
  • No strongly influential outliers.

We dealt with the first one already when we subsetted the original dataframe to contain only transactions from single costumers with no repeats. We know the second one holds as well since our outcome variable isFraud can only be either 0 or 1. The third one we need to evaluate. How much is "sufficiently large"? Opinions vary, but there are some good rules of thumb: the total sample size should be at least 500 observations, and that there should be at least 10 observations with the least frequent outcome for each independent variable. The sample size is already over 500 even for our subset pay_df, and we can see that the second rule of thumb holds as well:

In [ ]:
for col in pay_df[['type', 'amount', 'oldbalanceOrig']].columns:
    print(col,'\n', pay_df[col].value_counts(), '\n')
type 
 CASH_OUT    7012
TRANSFER    4747
PAYMENT     2711
CASH_IN     1830
DEBIT         58
Name: type, dtype: int64 

amount 
 10000000.00    293
1165187.89       4
429257.45        4
362631.05        3
142791.28        3
              ... 
25714.69         1
226229.51        1
121636.35        1
64428.22         1
150155.42        1
Name: amount, Length: 12146, dtype: int64 

oldbalanceOrig 
 0.00           2748
10000000.00     142
164.00            5
1165187.89        4
429257.45         4
               ... 
369.00            1
93659.00          1
61143.00          1
52267.00          1
86243.00          1
Name: oldbalanceOrig, Length: 9373, dtype: int64 

Let's check for multicollinearity using the variance inflation factor (VIF):

In [ ]:
pre_X_df = pay_df[['type', 'amount', 'oldbalanceOrig']]
X_df = pre_X_df.copy()
X_df['type'] = le.fit_transform(pre_X_df['type'])
y_df = pay_df['isFraud']
    
X_train_df, X_test_df, y_train_df, y_test_df = train_test_split(X_df, y_df, train_size = 0.7, test_size = 0.3, random_state=100)

X_train_df = scaler.fit_transform(X_train_df)
X_test_df = scaler.transform(X_test_df)

lr.fit(X_train_df, y_train_df);
In [ ]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame({'Feature': X_df.columns, 'VIF': [variance_inflation_factor(X_df.values, i) for i in range(len(X_df.columns))]})
vif_data
Out[ ]:
Feature VIF
0 type 1.176049
1 amount 2.060868
2 oldbalanceOrig 1.867042

Since all VIF values are below 5, we can be safe there is little to no multicollinearity, so this assumption is met.

Next we will check for linearity of independent variables and log-odds. This can be performed using what is known as Box-Tidwell test and then checked visually. The Box-Tidwell test involves transforming the continuous dependent variables using the following formula: $$var * ln(var)$$ So we need to perform these transformations first to then fit a model and evaluate its results.

In [ ]:
pay_df = pay_df[(pay_df['oldbalanceOrig'] > 0)].reset_index()
pre_X_df = pay_df[['type', 'amount', 'oldbalanceOrig']]
X_df = pre_X_df.copy()
X_df['type'] = le.fit_transform(pre_X_df['type'])
y_df = pay_df['isFraud']
In [ ]:
X_df['log_amount'] = X_df['amount'] * np.log(X_df['amount'])
X_df['log_oldbalanceOrig'] = X_df['oldbalanceOrig'] * np.log(X_df['oldbalanceOrig'])
In [ ]:
X_lt = sm.tools.tools.add_constant(X_df, prepend=False)
In [ ]:
logit_results = sm.GLM(y_df, X_lt, family=sm.families.Binomial()).fit()
logit_results.summary()
Out[ ]:
Generalized Linear Model Regression Results
Dep. Variable: isFraud No. Observations: 13610
Model: GLM Df Residuals: 13604
Model Family: Binomial Df Model: 5
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -6285.4
Date: Wed, 26 Oct 2022 Deviance: 12571.
Time: 13:52:38 Pearson chi2: 5.83e+04
No. Iterations: 8 Pseudo R-squ. (CS): 0.3450
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
type 0.4805 0.016 29.744 0.000 0.449 0.512
amount 2.212e-05 7.02e-07 31.515 0.000 2.07e-05 2.35e-05
oldbalanceOrig -9.732e-07 2.12e-07 -4.588 0.000 -1.39e-06 -5.57e-07
log_amount -1.326e-06 4.47e-08 -29.663 0.000 -1.41e-06 -1.24e-06
log_oldbalanceOrig 4.728e-08 1.24e-08 3.803 0.000 2.29e-08 7.16e-08
const -1.8001 0.051 -35.326 0.000 -1.900 -1.700

We are interested in seeing if the p value of the test for the variables log_amount and log_oldbalanceOrig is non-significant (i.e. p > 0.05). Because if it is significant (as it is the case for our model) it means that the relationship between the parent variables (amount and oldbalanceOrig) and the log-odds of the outcome variable is not linear. We can see this visually:

In [ ]:
predicted = logit_results.predict(X_lt)

fig, axes = plt.subplots(1,2, figsize=(15,5))
sns.scatterplot(x= X_lt['amount'].values, y=predicted, ax = axes[0], alpha=0.4).set_title('`Amount` vs Log-odds (isFraud)', fontsize=12, weight='bold')
axes[0].set_xlabel('`Amount` (M)')
axes[0].set_ylabel('Log-odds')

sns.scatterplot(x= X_lt['oldbalanceOrig'].values, y=predicted, ax = axes[1], alpha=0.4).set_title('`oldbalanceOrig` vs Log-odds (isFraud)', fontsize=12, weight='bold')
axes[1].set_xlabel('`oldbalanceOrig`')
axes[1].set_ylabel('Log-odds');

This means that this assumption is not met, which is critical because it means that we might have to introduce transformations on those variables to capture the non-linearity or use another model altogether. But let's check the final assumption next: no strongly influential outliers. We'll use statsmodels' get_influence module for this.

In [ ]:
influence = logit_results.get_influence(observed=True)
In [ ]:
fig,ax = plt.subplots(figsize=(15,5))
influence.plot_index(ax = ax)
plt.plot(range(len(X_lt)), [4/len(X_lt)]*len(X_lt), c='r', alpha=0.8) #4/len(df) is a standard threshold for influence
fig.tight_layout()
In [ ]:
fig, ax = plt.subplots(figsize=(15,5))
influence.plot_influence(ax = ax)
fig.tight_layout()

So we can see that we have many outliers and many influential points in our subsetted dataframe pay_df. The index plot and influence plot give you a good overview of what the outliers (studentised residuals axis) and the highly influential points lie (leverage axis). However, when there are just too many points it makes more sense to see the tabular result, first for Cook's D (cooks_d), then for Influence (hat_diag) and then for studentised residuals (standard_resid):

In [ ]:
influence_df = influence.summary_frame()
influence_df['cooks_d'].sort_values(ascending=False)[:10]
Out[ ]:
10213    0.048803
4752     0.040594
1749     0.040593
2779     0.040592
558      0.040588
13445    0.033482
1907     0.014154
5107     0.010009
12534    0.009133
1818     0.008870
Name: cooks_d, dtype: float64
In [ ]:
influence_df['hat_diag'].sort_values(ascending=False)[:10]
Out[ ]:
10213    0.184137
13445    0.160522
12534    0.091825
10215    0.084773
13447    0.066821
13567    0.055505
9716     0.050621
12207    0.037901
12536    0.028705
10217    0.025763
Name: hat_diag, dtype: float64
In [ ]:
influence_df['standard_resid'].sort_values()[:10]
Out[ ]:
1907   -73.925830
5107   -70.436013
1818   -68.177359
4441   -55.446559
5071   -47.438348
1749   -42.466870
4752   -42.269426
2779   -42.185536
558    -42.070360
424    -35.670637
Name: standard_resid, dtype: float64

This gives us an idea of where to start trimming if we wanted to go ahead with a Logistic Regression model. However, considering that we did not meet the linearity assumption earlier, it would be wiser to consider other models before coming back to try to transform the data. After all, logistic regression is just one of many, many supervised machine learning algorithms out there.

Step 3 - Considering an alternative model¶

Here is a question we haven't expressly considered yet: are we interested in setting up a model for predictive (categorisation) purposes or for inference purposes? The choice of a model, as well as the validity and usefulness of the chosen model, depends on the answer to this question.

If we wanted to build a model to understand how changes in the predictive variables are connected to changes in the outcome variable (in this case, how changes in amount or payment type are connected to the likelihood of a transaction being classified as Fraud), then we are interested in the inference side of things. If, on the other hand, we just wanted to create a model that will allow us to predict whether a new payment is likely to be fraudulent or not with good accuracy and few errors, then we are just interested in classification.

The assumptions of the models we've reviewed here (and that you can see on many other blogs about checking the validity of the assumptions of models such as linear regression) are more concerned with the inference part of the problem. This is important because if your assumptions do not hold, then you cannot make good conclusions from your model: if the model is not valid (i.e. the assumptions do not hold) then the relationship between the predictive variables and the outcome do not hold in consequence, and thus we cannot correctly conclude about how the independent variables affect the outcome.

Making the distinction here allows us to expand our choices into models that do not care much about assumptions (e.g. Decision Trees/Random Forest, SVMs, etc) but will give us the flexibility of fitting the data better and transforming it on the go. So, for now, and considering we are more interested in predicting (classifying) fraud that understanding the relationship between the variables and the outcome, we will use a personal favourite of mine to solve this problem: Support Vector Machines, which allow you to choose kernels that are not bound by two-dimensional spaces and can easily handle polinomial variables without having to transform our data endlessly.

So we'll first use default and arbitrary hyperparameters and then we'll tune them. Please note that before implementing any SVM model we need to scale the data, but in our case we already did this when implementing our Logistic Regression model.

In [ ]:
from sklearn.svm import SVC

svm = SVC(kernel='rbf', C= 1, gamma=0.1, random_state=0) #C = 1 is default, and gamma is just arbitrary for now
svm.fit(X_train_df, y_train_df)

y_predict_df = svm.predict(X_test_df)

print(
    'Training score:\t', svm.score(X_train_df, y_train_df),
    '\nTest score:\t', svm.score(X_test_df, y_test_df)
)
Training score:	 0.843056768558952 
Test score:	 0.8378158109209454
In [ ]:
print(
    'Confusion matrix:' '\n', confusion_matrix(y_test_df, y_predict_df),
    '\nRecall Score\t:', recall_score(y_test_df, y_predict_df),
    '\nPrecision Score\t:', precision_score(y_test_df, y_predict_df),
    '\nF1 Score:\t', f1_score(y_test_df, y_predict_df),
)
Confusion matrix:
 [[2315  139]
 [ 657 1797]] 
Recall Score	: 0.7322738386308069 
Precision Score	: 0.9282024793388429 
F1 Score:	 0.8186788154897495

We see that even with the defaults, we get much better results altogether! And we haven't optimised the hyperparameters yet! Hyperparameter tuning is what we'll do next to get the best possible scores. We could manually search for different combinations of hyperparameters, or even set up a loop as follows and then repeat it for other kernels:

largest_score = {'score': 0, 'gamma': 1, 'C': 1}
for gamma in range(1,20):
  for C in range(1,20):
    classifier = SVC(kernel='rbf', C=C, gamma=gamma)
    classifier.fit(X_train_df, y_train_df)
    score = classifier.score(X_test_df, y_train_df)
    if score > largest['score']:
      largest['score'] = score
      largest['gamma'] = gamma
      largest['C'] = C
print(largest_score)

But there are more efficient ways to do this. One of such ways is using GridSearchCV, which performs an "exhaustive search over specified parameter values for an estimator", basically using dictionaries and lists as evaluation points for the model and outputs a series of metrics that we can use to evaluate what combination of hyperparameters gives the best results. Importantly, it has the advantage of performing cross-validation on the go! Here is more information about the hyperparameters for the RBF kernel for a SVM.

First, we set up the ranges for the hyperparameter search using numpy's logspace.

In [ ]:
from sklearn.model_selection import GridSearchCV
In [ ]:
C_range = np.logspace(-10,10,5)
gamma_range = np.logspace(-10, 10, 5)
param_grid = dict(gamma=gamma_range, C=C_range)
In [ ]:
cv_search = GridSearchCV(svm, param_grid=param_grid, scoring='f1') #we'll use F1 since it balances precision and recall as seen earlier
cv_search.fit(X_train_df, y_train_df);

Results can then be explored by passing the results to pandas and converting it to a dataframe:

In [ ]:
search_results = pd.DataFrame(cv_search.cv_results_)
search_results.sort_values(by='rank_test_score').head()
Out[ ]:
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_gamma params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
22 30.938518 6.145725 0.049392 0.001331 10000000000.0 1.0 {'C': 10000000000.0, 'gamma': 1.0} 0.972696 0.974071 0.970940 0.975986 0.965517 0.971842 0.003569 1
17 4.847964 1.539862 0.072410 0.001604 100000.0 1.0 {'C': 100000.0, 'gamma': 1.0} 0.968207 0.971698 0.964816 0.968976 0.957071 0.966153 0.005044 2
13 1.670781 0.061041 0.381077 0.002482 1.0 100000.0 {'C': 1.0, 'gamma': 100000.0} 0.951043 0.943861 0.951649 0.940693 0.941446 0.945738 0.004701 3
18 2.389618 0.219896 0.354206 0.001954 100000.0 100000.0 {'C': 100000.0, 'gamma': 100000.0} 0.942844 0.936151 0.943379 0.935232 0.931305 0.937782 0.004649 4
12 0.440423 0.004139 0.188555 0.001574 1.0 1.0 {'C': 1.0, 'gamma': 1.0} 0.887563 0.894785 0.883912 0.904084 0.903811 0.894831 0.008226 5

Or if we just want to know the best parameters we can print them directly:

In [ ]:
print("The best parameters are %s with a score of %0.4f" % (cv_search.best_params_, cv_search.best_score_))
The best parameters are {'C': 10000000000.0, 'gamma': 1.0} with a score of 0.9718

This is a really high score, so we can now use those parameters to test the performance of the classifier on the test data (even though GridSearCV already used some of the train data as test data multiple times, but we haven't touched the test data we set aside as X_test_df and y_test_df)

In [ ]:
svm_final = SVC(kernel='rbf', C= 10000000000, gamma=1.0, random_state=0)
svm_final.fit(X_train_df, y_train_df);
In [ ]:
y_predict_df = svm_final.predict(X_test_df)

print(
    'Confusion matrix:' '\n', confusion_matrix(y_test_df, y_predict_df),
    '\nRecall Score\t:', recall_score(y_test_df, y_predict_df),
    '\nPrecision Score\t:', precision_score(y_test_df, y_predict_df),
    '\nF1 Score:\t', f1_score(y_test_df, y_predict_df),
)
Confusion matrix:
 [[2337  117]
 [  19 2435]] 
Recall Score	: 0.9922575387123065 
Precision Score	: 0.954153605015674 
F1 Score:	 0.9728326008789453

We could keep improving this classifier, but I think this suffices for this exercise.

Thanks for reading!