Prosper loan dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.
Prosper was founded in 2005 as the first peer-to-peer lending marketplace in the United States. Since then, Prosper has facilitated more than USD 12 billion in loans to more than 770,000 people.
Through Prosper, people can invest in each other in a way that is financially and socially rewarding. Borrowers apply online for a fixed-rate, fixed-term loan between USD 2,000 and USD 40,000. Individuals and institutions can invest in the loans and earn attractive returns. Prosper handles all loan servicing on behalf of the matched borrowers and investors. (source: https://www.prosper.com/about)
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
from pywaffle import Waffle
import plotly.express as px
pd.set_option('display.max_colwidth', None) # to display the whole strings and not collapse them
pd.set_option('display.max_columns', None) # to not collapse columns when viewing data
pd.set_option('display.max_rows', None) # to display all rows in the output
import warnings
warnings.filterwarnings("ignore") # warnings off
df_full = pd.read_csv('prosperLoanData.csv')
df_full.head()
df_full.info()
Using 81 variables for the analysis is not realistic, we will select around 15 of the most interesting ones.
We will use the data dictionary to assist with the selection of relevant variables:
Variable | Description |
---|---|
ListingKey | Unique key for each listing, same value as the 'key' used in the listing object in the API. |
ListingNumber | The number that uniquely identifies the listing to the public as displayed on the website. |
ListingCreationDate | The date the listing was created. |
CreditGrade | The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings. |
Term | The length of the loan expressed in months. |
LoanStatus | The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket. |
ClosedDate | Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses. |
BorrowerAPR | The Borrower's Annual Percentage Rate (APR) for the loan. |
BorrowerRate | The Borrower's interest rate for this loan. |
LenderYield | The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee. |
EstimatedEffectiveYield | Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009. |
EstimatedLoss | Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009. |
EstimatedReturn | The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009. |
ProsperRating (numeric) | The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009. |
ProsperRating (Alpha) | The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009. |
ProsperScore | A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. |
ListingCategory | The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans |
BorrowerState | The two letter abbreviation of the state of the address of the borrower at the time the Listing was created. |
Occupation | The Occupation selected by the Borrower at the time they created the listing. |
EmploymentStatus | The employment status of the borrower at the time they posted the listing. |
EmploymentStatusDuration | The length in months of the employment status at the time the listing was created. |
IsBorrowerHomeowner | A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner. |
CurrentlyInGroup | Specifies whether or not the Borrower was in a group at the time the listing was created. |
GroupKey | The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation. |
DateCreditPulled | The date the credit profile was pulled. |
CreditScoreRangeLower | The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency. |
CreditScoreRangeUpper | The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency. |
FirstRecordedCreditLine | The date the first credit line was opened. |
CurrentCreditLines | Number of current credit lines at the time the credit profile was pulled. |
OpenCreditLines | Number of open credit lines at the time the credit profile was pulled. |
TotalCreditLinespast7years | Number of credit lines in the past seven years at the time the credit profile was pulled. |
OpenRevolvingAccounts | Number of open revolving accounts at the time the credit profile was pulled. |
OpenRevolvingMonthlyPayment | Monthly payment on revolving accounts at the time the credit profile was pulled. |
InquiriesLast6Months | Number of inquiries in the past six months at the time the credit profile was pulled. |
TotalInquiries | Total number of inquiries at the time the credit profile was pulled. |
CurrentDelinquencies | Number of accounts delinquent at the time the credit profile was pulled. |
AmountDelinquent | Dollars delinquent at the time the credit profile was pulled. |
DelinquenciesLast7Years | Number of delinquencies in the past 7 years at the time the credit profile was pulled. |
PublicRecordsLast10Years | Number of public records in the past 10 years at the time the credit profile was pulled. |
PublicRecordsLast12Months | Number of public records in the past 12 months at the time the credit profile was pulled. |
RevolvingCreditBalance | Dollars of revolving credit at the time the credit profile was pulled. |
BankcardUtilization | The percentage of available revolving credit that is utilized at the time the credit profile was pulled. |
AvailableBankcardCredit | The total available credit via bank card at the time the credit profile was pulled. |
TotalTrades | Number of trade lines ever opened at the time the credit profile was pulled. |
TradesNeverDelinquent | Number of trades that have never been delinquent at the time the credit profile was pulled. |
TradesOpenedLast6Months | Number of trades opened in the last 6 months at the time the credit profile was pulled. |
DebtToIncomeRatio | The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%). |
IncomeRange | The income range of the borrower at the time the listing was created. |
IncomeVerifiable | The borrower indicated they have the required documentation to support their income. |
StatedMonthlyIncome | The monthly income the borrower stated at the time the listing was created. |
LoanKey | Unique key for each loan. This is the same key that is used in the API. |
TotalProsperLoans | Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans. |
TotalProsperPaymentsBilled | Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans. |
OnTimeProsperPayments | Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans. |
ProsperPaymentsLessThanOneMonthLate | Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans. |
ProsperPaymentsOneMonthPlusLate | Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans. |
ProsperPrincipalBorrowed | Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans. |
ProsperPrincipalOutstanding | Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans. |
ScorexChangeAtTimeOfListing | Borrower's credit score change at the time the credit profile was pulled. This will be the change relative to the borrower's last Prosper loan. This value will be null if the borrower had no prior loans. |
LoanCurrentDaysDelinquent | The number of days delinquent. |
LoanFirstDefaultedCycleNumber | The cycle the loan was charged off. If the loan has not charged off the value will be null. |
LoanMonthsSinceOrigination | Number of months since the loan originated. |
LoanNumber | Unique numeric value associated with the loan. |
LoanOriginalAmount | The origination amount of the loan. |
LoanOriginationDate | The date the loan was originated. |
LoanOriginationQuarter | The quarter in which the loan was originated. |
MemberKey | The unique key that is associated with the borrower. This is the same identifier that is used in the API member object. |
MonthlyLoanPayment | The scheduled monthly loan payment. |
LP_CustomerPayments | Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries. |
LP_CustomerPrincipalPayments | Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries. |
LP_InterestandFees | Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries. |
LP_ServiceFees | Cumulative service fees paid by the investors who have invested in the loan. |
LP_CollectionFees | Cumulative collection fees paid by the investors who have invested in the loan. |
LP_GrossPrincipalLoss | The gross charged off amount of the loan. |
LP_NetPrincipalLoss | The principal that remains uncollected after any recoveries. |
LP_NonPrincipalRecoverypayments | The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal. |
PercentFunded | Percent the listing was funded. |
Recommendations | Number of recommendations the borrower had at the time the listing was created. |
InvestmentFromFriendsCount | Number of friends that made an investment in the loan. |
InvestmentFromFriendsAmount | Dollar amount of investments that were made by friends. |
Investors | The number of investors that funded the loan. |
# there seem to be multiple loan identifiers
df_full.ListingKey.nunique()
df_full.ListingNumber.nunique()
df_full.LoanKey.nunique()
We can choose any of the IDs to be the index of our dataset since they all seem to represent the loan ID.
However, we have a slightly lower number of unique keys than data points.
Checking the above unique identifiers suggests there are some duplicates in the data, let's clean these first:
df_full[df_full.LoanKey.duplicated() == True].count()['LoanKey']
df_full[df_full.LoanKey.duplicated() == True]['LoanKey'].head(5)
# different in ProsperScore, where the value also should not be > 10 according to the dictionary and here we have 11
df_full[df_full.LoanKey == '02163700809231365A56A1C']
# the same issue as above
df_full[df_full.LoanKey == 'CB643706582969191E8C26B']
df_full.duplicated().sum()
# drop the problematic ProsperScore column
df_full.drop('ProsperScore', axis=1, inplace=True)
df_full.duplicated().sum()
# drop duplicate rows
df_full.drop_duplicates(inplace=True)
df_full[df_full.LoanKey.duplicated() == True].count()['LoanKey']
df_full.LoanKey.nunique()
# all loans are now included only once
df_full.shape
The main loan and borrower indicators were selected for further analysis based on the data dictionary.
The full selected list can be seen below:
# this variable subset will be a starting point for further cleaning and investigation of relevant data
df = df_full[['LoanKey', 'LoanOriginationDate', 'ClosedDate', 'LoanStatus', 'Term', 'LoanOriginalAmount', 'MonthlyLoanPayment',\
'ListingCategory (numeric)', 'BorrowerAPR', 'BorrowerRate', \
'CreditGrade', 'ProsperRating (Alpha)', 'CreditScoreRangeLower', 'CreditScoreRangeUpper',\
'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'DebtToIncomeRatio', 'EmploymentStatus',\
'IsBorrowerHomeowner']]
df.head()
# data type changes follow
df.info()
# rename columns with spaces
df.rename(columns={'ListingCategory (numeric)': 'ListingCategory', 'ProsperRating (Alpha)': 'ProsperRating'}, inplace=True)
# LoanKey as index
df.set_index('LoanKey', inplace=True)
# date columns as dates
df.LoanOriginationDate = pd.to_datetime(df.LoanOriginationDate)
df.ClosedDate = pd.to_datetime(df.ClosedDate)
df.IncomeRange.value_counts()
# reduce no. of IncomeRange categories by 1
df.IncomeRange.replace({'$0': 'Not employed'}, inplace=True)
# CreditGrade, ProsperRating, IncomeRange as ordered categorical vars
var_dict = {'CreditGrade': ['AA', 'A', 'B', 'C', 'D', 'E', 'HR', 'NC'],
'ProsperRating': ['AA', 'A', 'B', 'C', 'D', 'E', 'HR', 'NC'],
'IncomeRange': ['Not displayed', 'Not employed', '$1-24,999',
'$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+']}
for var in var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered=True, categories = var_dict[var])
df[var] = df[var].astype(ordered_var)
df.describe()
We are dealing with a loan database of over 113k loans with 81 variables. To be able to do a meaningful analysis, we had to select a subset of these variables.
The interest rate of the loan, its amount, and the credit rating of the borrower are the main features of interest.
The supporting features of interest are the loan characteristics on the one hand, such as the loan amount, the date when the loan was taken, the term of the loan, the status of the loan, and some borrower characteristics on the other hand, such as the income, the employment and housing situation and debt to income ration of the borrower.
df.head(3)
df.shape
# all histograms in one place for a quick overview
# some vars are very skewed and will require transformations
df.hist(['Term', 'LoanOriginalAmount', 'MonthlyLoanPayment', 'ListingCategory', 'BorrowerAPR', 'BorrowerRate',\
'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'StatedMonthlyIncome', 'DebtToIncomeRatio'],\
figsize=(10,10));
We will take look at all selected variables one by one to learn more details about their distributions.
df.LoanOriginationDate.describe()
df[df.LoanOriginationDate < '2009-01-01 00:00:00'].count()['LoanOriginationDate']
Dataset includes loans from period 2005-2014.
28,940 loans are from the period 2005-2008.
# 2009 was a critical year for Prosper
df.groupby(df.LoanOriginationDate.dt.year).count()['LoanOriginationDate'].plot(kind='bar')
plt.ylabel('Number of loans')
plt.xticks(rotation=0)
plt.tight_layout();
df['LoanYear'] = df['LoanOriginationDate'].dt.year
df.LoanYear.dtype
# let's look at the tails: 2005 and 2014 to know why there are fewer loans on those years
df[df.LoanOriginationDate.dt.year == 2005].count()['LoanOriginationDate']
# our data ends in March 2014
df.LoanOriginationDate.max()
Prosper started business in 2005, which is also the oldest loans which we have in the data set. However, there are only 22 loans from 2005.
Prosper stopped their business between Nov 2008 - July 2009 due to a cease order by SEC, which is very much reflected in the low number of loans in 2009.
Borrowers may come from 47 US states, bidders can come only from 28 US states since the relaunch. (source: https://en.wikipedia.org/wiki/Prosper_Marketplace)
Since 2010, the number of new loans increases steadily, with a record number of new loans in 2013.
We only have two months of full data for 2014.
# 55076 loans are already closed
df.ClosedDate.notnull().sum()
df.groupby(df.ClosedDate.dt.year).count()['ClosedDate'].plot(kind='bar')
plt.ylabel('Number of loans')
plt.xticks(rotation=0)
plt.tight_layout();
# months between loan origination and closing date
df['MonthsToClosed'] = ((df.ClosedDate - df.LoanOriginationDate) / np.timedelta64(1, 'M'))
df['MonthsToClosed'].describe()
df[df['MonthsToClosed'] < 0]
There are 2 loans with closed date prior to loan origination date, which we will delete:
df.drop(df[df['MonthsToClosed'] < 0].index, inplace=True)
df.shape
bins = np.arange(0, df['MonthsToClosed'].max()+2, 2)
plt.hist(data = df, x = 'MonthsToClosed', bins = bins)
plt.title('Distribution of number of months to reach closed status');
df[['LoanOriginationDate', 'ClosedDate', 'MonthsToClosed', 'Term', 'LoanStatus']].head()
There are 55k of closed loans. Generated column MonthsToClosed looks at number of months between loan origination date and closed date.
The distribution of MonthsToClosed is bimodal with one peak around 12 months and another peak around 36 months.
df.LoanStatus.value_counts()
It makes sense to divide loan status further into active and non-active loans because these are two distinct groups.
ClosedDate can used for this.
df.loc[df['ClosedDate'].isnull(), 'LoanStatusActive'] = 1
df.loc[df['ClosedDate'].notnull(), 'LoanStatusActive'] = 0
df.LoanStatusActive.value_counts()
df.LoanStatusActive = df['LoanStatusActive'].astype('bool')
df.LoanStatusActive.dtype
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
# set base color for plots
base_color = sb.color_palette()[0]
sb.countplot(data = df[df.LoanStatusActive == 1], y = 'LoanStatus', color = base_color, orient = 'h')
plt.xticks(rotation = 0)
plt.title('Active loans')
plt.subplot(1,2,2)
order = ['FinalPaymentInProgress', 'Past Due (1-15 days)', 'Past Due (16-30 days)', \
'Past Due (31-60 days)', 'Past Due (61-90 days)', 'Past Due (91-120 days)', 'Past Due (>120 days)']
sb.countplot(data = df[df.LoanStatus.isin(order)], y = 'LoanStatus', order = order, color = base_color, orient = 'h')
plt.xticks(rotation = 0)
plt.title('Active loans (current status excluded)')
plt.tight_layout();
Among active loans, the vast majority are loans without any issues, i.e. in the "current" status.
Among active loans not in the current status, the highest amount is in default for 1-15 days.
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sb.countplot(data = df[df.LoanStatusActive == 0], y = 'LoanStatus', color = base_color)
plt.xticks(rotation = 0)
plt.title('Non-Active loans')
plt.subplot(1,2,2)
sb.countplot(data = df[df.LoanStatus.isin(['Defaulted', 'Chargedoff', 'Cancelled'])], y = 'LoanStatus', \
color = base_color)
plt.xticks(rotation = 0)
plt.title('Non-Active loans (completed status excluded)')
plt.tight_layout();
Most non-active loans are in "completed" status, i.e. fully repaid.
In the not-repaid group of non-active loans, there are more charged off loans (i.e. those were the chance of repayment is minimal) than defaulted ones.
df[df.LoanStatus == 'Chargedoff'].count()['LoanStatus'] / len(df)
Charged off loans make 10.6% of all loans.
# make status an ordered categorical var
ordered_var = pd.api.types.CategoricalDtype(ordered=True, categories = ['Current', 'FinalPaymentInProgress',
'Past Due (1-15 days)', 'Past Due (16-30 days)',
'Past Due (31-60 days)', 'Past Due (61-90 days)',
'Past Due (91-120 days)', 'Past Due (>120 days)',
'Completed', 'Defaulted', 'Chargedoff', 'Cancelled'])
df['LoanStatus'] = df['LoanStatus'].astype(ordered_var)
sb.countplot(data = df, x = 'Term', palette = 'pastel');
data = {'12': df[df['Term'] == 12].count()['Term'],
'36': df[df['Term'] == 36].count()['Term'], '60': df[df['Term'] == 60].count()['Term']}
fig = plt.figure(
FigureClass=Waffle,
rows=9,
columns=9,
values=data,
figsize=(4, 4),
legend={'loc': 'upper left', 'bbox_to_anchor': (1.1, 1)},
colors=("lightblue", "peachpuff", "lightgreen")
)
plt.title('Term distribution')
plt.show()
There are only 3 distinct loan terms:
The vast majority of loans is for 36 months.
plt.figure(figsize=(13,4))
plt.suptitle('Loan original amount distribution')
plt.subplot(1,2,1)
bins = np.arange(0, df['LoanOriginalAmount'].max()+900, 900)
plt.hist(data = df, x = 'LoanOriginalAmount', bins = bins)
plt.subplot(1,2,2)
sb.distplot(df['LoanOriginalAmount']);
np.sort(df.LoanOriginalAmount.value_counts().nlargest(8).index)
df.LoanOriginalAmount.value_counts().nlargest(8)
Loan amounts are right skewed with several peaks as shown in the histogram.
The most frequently loaned amount is 4000.
bins = np.arange(0, df['MonthlyLoanPayment'].max()+100, 100)
plt.hist(data = df, x = 'MonthlyLoanPayment', bins = bins)
plt.xlabel('Monthly loan payment')
plt.ylabel('Number of loans');
df.MonthlyLoanPayment.mode()
df.MonthlyLoanPayment.value_counts().nlargest(5)
df.MonthlyLoanPayment.describe()
Monthly payment distribution is right skewed and has a long tail.
Let's look at its log transformed distribution:
log_binsize = 0.02
bins = 10 ** np.arange(1.3, np.log10(df['MonthlyLoanPayment'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = df, x = 'MonthlyLoanPayment', bins = bins)
plt.xscale('log')
plt.xlabel('Monthly loan payment')
plt.ylabel('Number of loans')
plt.xticks([90, 173, 400, 1e3, 2e3], [90, 173, 400, '1k', '2k'])
plt.show()
Now we can see the distribution more clearly. Many loans have very low loan payments and the mode of 173 is clearly visible.
# keep ListingCategory as int to easily keep the same order in all plots and have legible axis labels
sb.countplot(data = df, x = 'ListingCategory', color = base_color);
The dictionary for listing categories is the following:
By far the most loans are debt consolidation and many loans do not have a category defined.
The distribution shows that ListingCategory might not be a very interesting variable for further analysis after all and if, then only in the top 5 categories. Below top 5 categories, there is a very small loan sample.
plt.figure(figsize=(13,4))
plt.suptitle("Borrower's Annual Percentage Rate distribution")
plt.subplot(1,2,1)
bins = np.arange(0, df.BorrowerAPR.max()+0.02, 0.02)
plt.hist(data = df, x = 'BorrowerAPR', bins = bins)
plt.subplot(1,2,2)
sb.distplot(df['BorrowerAPR']);
plt.figure(figsize=(13,4))
plt.suptitle("Borrower's interest rate distribution")
plt.subplot(1,2,1)
bins = np.arange(0, df.BorrowerRate.max()+0.02, 0.02)
plt.hist(data = df, x = 'BorrowerRate', bins = bins)
plt.subplot(1,2,2)
sb.distplot(df['BorrowerRate']);
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
g = sb.distplot(df['BorrowerAPR'])
plt.subplot(1,2,2)
sb.distplot(df['BorrowerRate'])
plt.ylim(g.get_ylim());
df[['BorrowerAPR', 'BorrowerRate']].describe()
BorrowerAPR and BorrowerRate have very similar distributions.
APR rate is somewhat larger because it includes fees.
# the most common rates are very high, as is also seen by the peak in the histogram
df['BorrowerAPR'].value_counts().nlargest(5)
It is somewhat suprising that 25% of loans have APR larger than 28%, which seems very high.
However, looking at the very extreme end of APR 0.4, there are 68 fully repaid loans and just 34 charged-off or defaulted in this category:
df[(df['BorrowerAPR']>= 0.4) & (df['LoanStatus'] == 'Completed')].count()['Term']
df[(df['BorrowerAPR']>= 0.4) & ((df['LoanStatus'] == 'Defaulted') | (df['LoanStatus'] == 'Chargedoff'))].count()['Term']
BorrowerAPR and BorrowerRate contain the same information, let's keep just one of them and drop the other one:
df.drop('BorrowerRate', axis = 1, inplace = True)
These two columns belong together because they are mutually separated by age of loans. We can merge these two columns to get a rating for all loans.
To prove this point and look at the final rating distribution, we have to do some bivariate plotting already at this point.
Further, we will create another additional column which will group the ratings into risk categories.
plt.figure(figsize=(13,4))
plt.suptitle("Credit rating variables distribution")
plt.subplot(1,2,1)
sb.countplot(data = df, x = 'CreditGrade', palette = 'RdYlBu_r')
plt.subplot(1,2,2)
sb.countplot(data = df, x = 'ProsperRating', palette = 'RdYlBu_r');
There are fewer loans with CreditGrade because it was only used until 2009 and since then, ProsperRating is used:
plt.figure(figsize=(13,4))
plt.suptitle("Credit ratings amounts by year")
plt.subplot(1,2,1)
df.groupby(df.LoanOriginationDate.dt.year).count()['CreditGrade'].plot(kind='bar')
plt.title('CreditGrade')
plt.xticks(rotation = 0)
plt.subplot(1,2,2)
df.groupby(df.LoanOriginationDate.dt.year).count()['ProsperRating'].plot(kind='bar')
plt.title('ProsperRating')
plt.xticks(rotation = 0)
plt.tight_layout();
# create new variable CreditRating which combines the two rating columns
df['CreditRating'] = np.where(df.CreditGrade.isnull(), df.ProsperRating, df.CreditGrade)
# test for correctness
df[['CreditGrade', 'ProsperRating', 'CreditRating']].sample(15)
# make the new column CreditRating an ordinal categorical variable
df.CreditRating = df['CreditRating'].astype(pd.api.types.CategoricalDtype(ordered=True, categories = var_dict['CreditGrade']))
sb.countplot(data = df, x = 'CreditRating', palette = 'RdYlBu_r')
plt.xlabel('Credit Rating')
plt.ylabel('Number of loans');
# 131 loans are missing a rating
df.CreditRating.isnull().sum()
# because they did not have a rating in either CreditGrade or ProsperRating columns
df[df.CreditRating.isnull()][['LoanOriginationDate', 'CreditGrade', 'ProsperRating', 'CreditRating']].sample(5)
# these loans are from 2009-2010
df[df.CreditRating.isnull()][['LoanOriginationDate', 'CreditGrade', 'ProsperRating']].describe()
# the other two ratings columns will not be needed anymore
df.drop(['CreditGrade', 'ProsperRating'], axis=1, inplace=True)
It will also be useful to create another rating variable, which will group scores into a few riskiness categories.
# create new variable CreditRating which combines the two rating columns
df['CreditRisk'] = np.where(df.CreditRating.isin(['AA', 'A']), 'low',
np.where(df.CreditRating.isin(['B', 'C', 'D']), 'medium',
np.where(df.CreditRating.isin(['E', 'HR', 'NC']), 'high', None)))
# test
df[['CreditRating', 'CreditRisk']].sample(5)
# test on NaN
df[df.CreditRating.isnull()][['CreditRating', 'CreditRisk']].sample(5)
df.CreditRisk.isnull().sum()
# CreditRisk as ordered categorical var
ordered_var = pd.api.types.CategoricalDtype(ordered=True, categories = ['low', 'medium', 'high'])
df['CreditRisk'] = df['CreditRisk'].astype(ordered_var)
sb.countplot(data = df, x = 'CreditRisk', palette = 'RdYlBu_r')
plt.xlabel('Credit Risk')
plt.ylabel('Number of loans');
There are similar amounts of low and high risk loans with the majority if all loans being of medium risk.
We will look at these two variables together to compare the gap between them.
plt.figure(figsize=(13,4))
plt.suptitle("Credit Score Range")
plt.subplot(1,2,1)
bins_lower = np.arange(0, df.CreditScoreRangeLower.max()+20, 20)
plt.hist(data = df, x = 'CreditScoreRangeLower', bins = bins_lower)
plt.xlabel('CreditScoreRangeLower')
plt.subplot(1,2,2)
bins_upper = np.arange(0, df.CreditScoreRangeUpper.max()+20, 20)
plt.hist(data = df, x = 'CreditScoreRangeUpper', bins = bins_upper)
plt.xlabel('CreditScoreRangeUpper');
df[['CreditScoreRangeLower', 'CreditScoreRangeUpper']].describe()
The two distributions are very similar and this is because the range buckets are not large (they are of size 20) and here are all the existing buckets listed:
df.groupby(['CreditScoreRangeLower', 'CreditScoreRangeUpper'])['LoanOriginationDate'].count()
All information will be kept even when we drop one of the columns and work just with the other one:
df.drop('CreditScoreRangeUpper', axis = 1, inplace = True)
sb.countplot(data = df, y = 'IncomeRange', color = base_color)
plt.xticks(rotation = 0);
We can see two large groups in the loans: mid earners on the one hand (25k - 75k) making the majority of loans, and high earners group (75k+) on the other.
df.IncomeVerifiable.value_counts().plot(kind='pie', autopct='%1.0f%%');
Majority of loans have borrowers with a verifiable income.
# monthly income is very right skewed
bins = np.arange(0, 30500, 500)
plt.hist(data = df, x = 'StatedMonthlyIncome', bins = bins)
plt.xlabel('Stated monthly income')
plt.xlim((0,30000));
log_binsize = 0.07
bins = 10 ** np.arange(2.5, np.log10(df['StatedMonthlyIncome'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = df, x = 'StatedMonthlyIncome', bins = bins)
plt.xscale('log')
plt.xticks([1000, 3000, 5000, 10000, 20000, 40000], ['1k', '3k', '5k', '10k', '20k', '40k'])
plt.xlim((0,80000))
plt.xlabel('Stated monthly income')
plt.show()
Given the very skewed distribution of stated monthly income, we transformed the scale. The distribution looks normal after the transformation.
df.StatedMonthlyIncome.describe()
Some borrowers have very high stated monthly income of over 90k.
These outliers might be errors, but we cannot be sure, so we don't delete them.
df[df.StatedMonthlyIncome > 90000][['IncomeVerifiable', 'CreditRisk', 'LoanStatus', 'EmploymentStatus', 'IncomeRange']].sample(5)
df.StatedMonthlyIncome.value_counts().nlargest(5)
plt.figure(figsize=(13,4))
plt.suptitle('Debt to income ratio distribution')
plt.subplot(1,2,1)
bins = np.arange(0, 10.5, 0.5)
plt.hist(data = df, x = 'DebtToIncomeRatio', bins = bins)
plt.xlabel('Debt to Income ratio')
plt.subplot(1,2,2)
bins = np.arange(0, 2.05, 0.05)
plt.hist(data = df, x = 'DebtToIncomeRatio', bins = bins)
plt.xlim((0,2))
plt.xlabel('Debt to Income ratio');
df.DebtToIncomeRatio.describe()
Debt to income ratio is very right skewed with bulk of values around 0.2. Let's transform the scale to look at the distribution in detail:
log_binsize = 0.05
bins = 10 ** np.arange(-1.5, np.log10(df['DebtToIncomeRatio'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = df, x = 'DebtToIncomeRatio', bins = bins)
plt.xscale('log')
plt.xticks([0.05, 0.1, 0.2, 0.3, 0.5, 0.8, 1.2], [0.05, 0.1, 0.2, 0.3, 0.5, 0.8, 1.2])
plt.xlim((0,5))
plt.xlabel('Debt to Income ratio')
plt.ylabel('Number of loans')
plt.show()
plt.figure(figsize=[9, 5])
status_counts = df['EmploymentStatus'].value_counts()
status_order = status_counts.index
max_count = status_counts[0]
max_prop = max_count / df.shape[0]
tick_props = np.arange(0, max_prop, 0.1)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
g = sb.countplot(data = df, x = 'EmploymentStatus', color = base_color, order = status_order)
plt.xticks(rotation = 0)
plt.xlabel('Employment Status')
plt.ylabel('Number of loans')
# values must be ordered from the largest for this to work
for i in range(status_counts.shape[0]):
count = status_counts[i]
pct_string = '{:0.1f}%'.format(100 * count / df.shape[0])
plt.text(i, count+10, pct_string, ha = 'center')
plt.tight_layout();
Majority of borrowers are employed.
df.IsBorrowerHomeowner.value_counts().plot(kind='pie', autopct='%1.0f%%');
Exactly half of loans belongs to a borrower who is a home owner.
numeric_vars = ['BorrowerAPR', 'LoanOriginalAmount', 'MonthlyLoanPayment', 'CreditScoreRangeLower',
'StatedMonthlyIncome', 'DebtToIncomeRatio', 'MonthsToClosed', 'Term']
bool_vars = ['LoanStatusActive', 'IncomeVerifiable', 'IsBorrowerHomeowner']
ordered_categoric_vars = ['IncomeRange', 'CreditRisk', 'CreditRating']
# pairplot of numeric vars to see relationships overview on a sample of data for faster rendering
sample = np.random.choice(df.shape[0], 500, replace = False)
df_samp = df.iloc[sample,:]
sb.pairplot(df_samp, vars = numeric_vars);
# correlations
plt.figure(figsize = [10, 7])
sb.heatmap(df[['BorrowerAPR', 'LoanOriginalAmount', 'MonthlyLoanPayment', 'CreditScoreRangeLower', 'StatedMonthlyIncome', \
'DebtToIncomeRatio', 'MonthsToClosed', 'Term', 'LoanStatusActive', 'IncomeVerifiable', 'IsBorrowerHomeowner']]\
.corr(), annot = True, fmt = '.2f',
cmap = 'vlag_r', center = 0)
plt.show()
The dependent variables of our interest are BorrowerAPR, the borrower annual percentage rate, and LoanOriginalAmount, i.e. the borrowed amount.
Based on the correlations, we will look into the following relationships of dependent variables:
Further, we see that borrowers with a verifiable income tend to have a lower debt to income ratio and that active loans are on average for a longer period, higher amounts and higher monthly payments than non-active loans.
Relationships of interest:
# a sample of 5000 from the data
sample = np.random.choice(df.shape[0], 5000, replace = False)
df_samp = df.iloc[sample,:]
plt.figure(figsize = (20,10))
plt.subplot(2,2,1)
g = plt.scatter(data = df_samp, x = 'LoanOriginalAmount', y = 'BorrowerAPR', alpha = 1/10)
plt.xscale('log')
plt.xlabel('Loan Original Amount (log scale)')
plt.ylabel('Borrower APR')
plt.ylim((0,0.45))
plt.xticks([1000, 2000, 5000, 10000, 20000], [1000, 2000, 5000, 10000, 20000])
# display the correlation
ax = plt.gca()
coef = df[['LoanOriginalAmount', 'BorrowerAPR']].corr().iloc[1,0]
label = r'$\rho$ = ' + str(round(coef, 2))
ax.annotate(label, xy = (0.75, 0.9), size = 20, xycoords = ax.transAxes)
plt.subplot(2,2,2)
plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'BorrowerAPR', alpha = 1/10)
plt.xscale('log')
plt.xlabel('Monthly Loan Payment (log scale)')
plt.ylabel('Borrower APR')
plt.xticks([10, 30, 100, 300, 800], [10, 30, 100, 300, 800])
plt.xlim((0,2000))
plt.ylim((0,0.45))
# display the correlation
ax = plt.gca()
coef = df[['MonthlyLoanPayment', 'BorrowerAPR']].corr().iloc[1,0]
label = r'$\rho$ = ' + str(round(coef, 2))
ax.annotate(label, xy = (0.75, 0.9), size = 20, xycoords = ax.transAxes)
plt.subplot(2,2,3)
plt.scatter(data = df_samp, x = 'CreditScoreRangeLower', y = 'BorrowerAPR', alpha = 1/10)
plt.xlabel('Credit Score Range Lower')
plt.ylabel('Borrower APR')
plt.xlim((400, 900))
plt.ylim((0,0.45))
# display the correlation
ax = plt.gca()
coef = df[['CreditScoreRangeLower', 'BorrowerAPR']].corr().iloc[1,0]
label = r'$\rho$ = ' + str(round(coef, 2))
ax.annotate(label, xy = (0.75, 0.9), size = 20, xycoords = ax.transAxes)
plt.suptitle('Borrower APR strongest relationships', size = 25);
The negative correlations of BorrowerAPR and loan original amount, monthly loan payment and credit score range lower are somewhat visible from the scatter plots and they all make logical sense.
plt.figure(figsize = (15,5))
plt.subplot(1,2,1)
plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'LoanOriginalAmount', alpha = 1/10)
plt.ylabel('Loan Original Amount')
plt.xlabel('Monthly Loan Payment')
plt.xlim((0,1500))
# display the correlation
ax = plt.gca()
coef = df[['MonthlyLoanPayment', 'LoanOriginalAmount']].corr().iloc[1,0]
label = r'$\rho$ = ' + str(round(coef, 2))
ax.annotate(label, xy = (0.1, 0.85), size = 20, xycoords = ax.transAxes)
plt.subplot(1,2,2)
plt.scatter(data = df_samp, x = 'CreditScoreRangeLower', y = 'LoanOriginalAmount', alpha = 1/10)
plt.xlabel('Credit Score Range Lower')
plt.ylabel('Loan Original Amount')
plt.xlim((400,900))
# display the correlation
ax = plt.gca()
coef = df[['CreditScoreRangeLower', 'LoanOriginalAmount']].corr().iloc[1,0]
label = r'$\rho$ = ' + str(round(coef, 2))
ax.annotate(label, xy = (0.1, 0.85), size = 20, xycoords = ax.transAxes)
plt.suptitle('Loan Original Amount strongest relationships', size = 20);
The positive correlation of loan original amount with monthly loan payment and credit score range lower is clear from the plots and makes logical sense, too.
Displaying monthly payment on a logarithmic scale shows the quadratic relationship with loan original amount.
plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'LoanOriginalAmount', alpha = 1/10)
plt.ylabel('Loan original amount')
plt.xlabel('Monthly loan payment')
plt.xlim((20,2000))
plt.xscale('log')
plt.xticks([90, 173, 400, 1e3, 2e3], [90, 173, 400, '1k', '2k']);
Next, let's look at relationships of categorical variables with the dependent variables.
# sample of 2000
samples = np.random.choice(df.shape[0], 1000, replace = False)
df_samp = df.iloc[samples,:]
def boxgrid(x, y, **kwargs):
""" Function for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x, y, color = default_color)
plt.figure(figsize = [15, 10])
g = sb.PairGrid(data = df_samp, x_vars = ['BorrowerAPR', 'LoanOriginalAmount'], y_vars = ordered_categoric_vars,
size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();
plt.figure(figsize = (15,5))
plt.suptitle('Interest rate differentiation by credit risk')
plt.subplot(1,2,1)
# subplots do not work when using catplot+violin kind
sb.violinplot(data=df_samp, x="CreditRisk", y="BorrowerAPR", palette = 'RdYlBu_r')
sb.despine()
plt.subplot(1,2,2)
bins = np.arange(0.05, 0.45,0.01)
plt.hist(df_samp[df_samp['CreditRisk']=='low']['BorrowerAPR'], alpha = 0.5, color = sb.color_palette('RdYlBu_r')[0], bins = bins)
plt.hist(df_samp[df_samp['CreditRisk']=='medium']['BorrowerAPR'], alpha = 0.5, color = sb.color_palette('RdYlBu_r')[3], bins = bins)
plt.hist(df_samp[df_samp['CreditRisk']=='high']['BorrowerAPR'], alpha = 0.5, color = sb.color_palette('RdYlBu_r')[5], bins = bins)
sb.despine()
plt.xlabel('Borrower APR')
plt.ylabel('Number of loans')
plt.show();
plt.figure(figsize = [10, 20])
g = sb.PairGrid(data = df_samp, x_vars = ['BorrowerAPR', 'LoanOriginalAmount'], y_vars = ['EmploymentStatus', 'LoanStatus'],
size = 4, aspect = 1.5)
g.map(boxgrid)
plt.show();
plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = df_samp, y_vars = ['BorrowerAPR', 'LoanOriginalAmount'], x_vars = bool_vars,
size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();
g = sb.FacetGrid(data = df, col = 'IncomeVerifiable')
g.map(sb.distplot, 'BorrowerAPR');
Borrowers with a verifiable income benefit from lower rate.
g = sb.FacetGrid(data = df, col = 'IsBorrowerHomeowner')
g.map(sb.distplot, 'LoanOriginalAmount');
Borrowers who own home get higher loan amounts.
plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = df_samp, y_vars = ['BorrowerAPR', 'LoanOriginalAmount'], x_vars = numeric_vars[-1:],
size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show()
plt.figure(figsize = (15,5))
plt.subplot(1,2,1)
g1 = sb.violinplot(data=df_samp, x="Term", y="LoanOriginalAmount", inner='stick', palette = 'pastel')
plt.subplot(1,2,2)
g2 = sb.barplot(data = df, x = 'Term', y = 'LoanOriginalAmount', palette = 'pastel')
plt.ylabel('Avg loan original amount')
plt.show();
sb.catplot(x="Term", y="BorrowerAPR", kind="violin", inner='stick', data=df_samp, palette = 'pastel');
sb.countplot(data = df, x = 'IsBorrowerHomeowner', hue = 'CreditRisk', palette = 'RdYlBu_r');
There are more high risk borrowers among those who do not own a house.
# most loans are for 36m and the most common rating is C
sb.countplot(data = df, x = 'CreditRating', hue = 'Term', palette = 'pastel');
The above box plots reveal the following (none of which is actually too surprising and it is what we would expect from a reasonable lending strategy):
Let's very briefly also look at other numeric vars than BorrowerAPR and LoanOriginalAmount:
plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = df_samp, y_vars = ['DebtToIncomeRatio', 'MonthlyLoanPayment'], x_vars = ['CreditRisk', 'Term'],
size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show()
The squeezed debt to income ration box plots tell us that there are many outliers in the data with very high ratios.
High risk borrowers have lower monthly payments compared to other risk categories.
plt.figure(figsize = [10, 15])
g = sb.PairGrid(data = df_samp, y_vars = ['BorrowerAPR', 'LoanOriginalAmount'], x_vars = 'ListingCategory',
size = 5, aspect = 1.5)
g.map(boxgrid)
plt.xticks(rotation = 0)
plt.show();
Many of the listing category groups are small, hence the long whiskers.
plt.figure(figsize = [10,12])
plt.subplot(3,1,1)
sb.countplot(data = df, x = 'EmploymentStatus', hue = 'CreditRisk', palette = 'RdYlBu_r')
plt.subplot(3,1,2)
sb.countplot(data = df, x = 'EmploymentStatus', hue = 'Term', palette = 'pastel')
plt.subplot(3,1,3)
sb.countplot(data = df, x = 'EmploymentStatus', hue = 'IsBorrowerHomeowner', palette = 'colorblind')
plt.tight_layout();
plt.figure(figsize = [10,12])
plt.subplot(3,1,1)
sb.countplot(data = df, x = 'IncomeRange', hue = 'CreditRisk', palette = 'RdYlBu_r')
plt.subplot(3,1,2)
sb.countplot(data = df, x = 'IncomeRange', hue = 'Term', palette = 'pastel')
plt.subplot(3,1,3)
sb.countplot(data = df, x = 'IncomeRange', hue = 'IsBorrowerHomeowner', palette = 'colorblind')
plt.tight_layout();
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR')
plt.title('Borrower APR development in time')
plt.ylabel('Borrower APR')
plt.xlabel('Loan origination year')
plt.ylim((0,0.3));
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount')
plt.title('Loan original amount development in time')
plt.ylabel('Average loan original amount')
plt.xlabel('Loan origination year');
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'IncomeVerifiable')
plt.ylim((0,1.2));
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'IsBorrowerHomeowner');
sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'IsBorrowerHomeowner', palette = 'colorblind');
sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'Term', palette = 'pastel')
plt.ylabel('Number of loans')
plt.legend(title = 'Term', loc = 'upper left');
plt.figure(figsize = [10,8])
sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'CreditRating', palette = 'RdYlBu_r');
plt.figure(figsize = [10,8])
sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'CreditRisk', palette = 'RdYlBu_r')
plt.xlabel('Loan origination year')
plt.ylabel('Number of loans')
plt.title('Loans distribution by credit risk');
plt.figure(figsize = [10,12])
sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'EmploymentStatus')
plt.ylim((0,6000));
Borrower APR is negatively correlated with the loan original amount, monthly loan payment and credit score range lower, all of which makes sense. If you can borrow larger amounts, you are likely a more solvent borrower, therefore you get a lower interest rate and you have a higher credit score and can afford and be allowed to pay higher monthly payments.
This interconnectedness of the main features is also shown by the positive correlation between loan original amount with monthly loan payment and with credit score.
The credit rating clearly separates borrowers into different interest rate groups, which is exactly what we would expect from a reasonable lending business. Many of the variables included in the full dataset are inputs for calculating the credit rating, some of which we are still investigating in this analysis. The purpose of all this data about borrowers should lead to an appropriate rating based on which he can get an appropriate interest rate and other loan details (amount, monthly payment, term). High risk borrowers for example get loans for much lower amounts than medium or low risk borrowers.
This section will build on previous findings and combine multiple variables to see how they interact.
A few sections will follow, each focusing on one variable and its relationship with others.
# Pair plot colored by credit risk
sb.pairplot(df_samp,
vars = ['BorrowerAPR', 'LoanOriginalAmount', 'MonthlyLoanPayment', 'CreditScoreRangeLower'],
hue = 'CreditRisk', diag_kind = 'kde', palette = 'RdYlBu_r',
plot_kws = {'alpha': 0.6, 's': 90},
size = 4);
Credit risk groups are clearly separated in our featured variables.
g = sb.FacetGrid(data = df_samp, col = 'CreditRisk', hue = 'Term',
size = 3, aspect = 1.5)
g.map(sb.scatterplot, 'LoanOriginalAmount', 'BorrowerAPR')
g.add_legend()
plt.show();
The rising interest rate among risk groups is visible, as well as the 60m loans having a rather higher interest rate than 36m loans on average in all risk groups.
g = sb.FacetGrid(data = df_samp, col = 'CreditRisk', hue = 'Term',
size = 3, aspect = 1.5)
g.map(sb.scatterplot, 'LoanOriginalAmount', 'MonthlyLoanPayment')
g.add_legend()
plt.show()
It is clear that for the same loan amount, the monthly payments differ mostly by the term of the loan.
It shows also that we have most borrowers in the medium risk category and that high risk borrowers cannot borrow very high loan amounts.
g = sb.FacetGrid(data = df_samp, col = 'CreditRisk', row = 'IsBorrowerHomeowner', hue = 'Term',
size = 3, aspect = 1.5)
g.map(sb.scatterplot, 'MonthlyLoanPayment', 'BorrowerAPR', alpha = 0.7)
g.add_legend()
plt.show();
Owning a home does not seem to make a huge difference in interest rate and monthly payments when separating by both credit risk and loan term.
However, we can see that in low risk group, there are more homeowners that non-home owners, whereas the opposite is true for the high risk group.
# it is interesting how the 60m loans appear in the center of distributions
g = sb.FacetGrid(data = df_samp, col = 'CreditRisk', hue = 'Term',
size = 3, aspect = 1.5)
g.map(sb.scatterplot, 'DebtToIncomeRatio', 'BorrowerAPR', alpha = 0.7)
plt.xlim((0,2))
g.add_legend()
plt.show();
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'CreditRisk', palette = 'RdYlBu_r');
df_rate_dev = df.groupby(['LoanYear', 'CreditRisk'])['BorrowerAPR'].mean()
df_rate_dev = pd.DataFrame(df_rate_dev) # created with multiindex
df_rate_dev.reset_index(inplace=True)
fig = px.bar(df_rate_dev,
y="CreditRisk",
x="BorrowerAPR",
animation_frame="LoanYear",
animation_group="BorrowerAPR",
orientation='h',
color="CreditRisk",
range_x=[0, df_rate_dev.BorrowerAPR.max()+0.05],
color_continuous_scale=px.colors.diverging.RdYlBu)
# improve aesthetics (size, grids etc.)
fig.update_layout(width=700,
height=500,
xaxis_showgrid=False,
yaxis_showgrid=False,
title_text='Evolution of Borrower APR by credit risk',
showlegend=False)
fig.update_xaxes(title_text='Average Borrower APR')
fig.update_yaxes(title_text='Credit Risk')
fig.show()
The interest rate differentiation for the risk group is very clear and quite large.
The average interest rate is declining for high and medium risk groups in the last couple of years, but this is not the case for low risk group.
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount', hue = 'CreditRisk', palette = 'RdYlBu_r');
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'IsBorrowerHomeowner', hue = 'CreditRisk', palette = 'RdYlBu_r');
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'IncomeVerifiable', hue = 'CreditRisk', palette = 'RdYlBu_r');
There is an interesting shift where in the beginning of business, the riskiest borrowers apparently had to have a verified income more often than the safer borrower groups, but this is no longer the case since 2009 and the high risk group also has the lowest share of verifiable income.
There are very clear differences among the credit risk groups, which is a good sign. Let's follow with looking if this still holds as clearly in the more detailed credit rating variable.
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'CreditRating', palette = 'RdYlBu_r')
plt.legend(loc = 'right', bbox_to_anchor=(1.25, 0.5), ncol=1);
plt.figure(figsize=(15,5))
sb.barplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount', hue = 'CreditRating', palette = 'RdYlBu_r')
plt.legend(loc = 'right', bbox_to_anchor=(1.1, 0.5), ncol=1);
sb.catplot(x = 'CreditRating', y = 'BorrowerAPR', hue = 'Term', kind = 'bar', data = df_samp, size = 6, palette = 'pastel');
sb.catplot(x = 'CreditRating', y = 'LoanOriginalAmount', hue = 'Term', kind = 'bar', data = df_samp, size = 6,\
palette = 'pastel');
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'IsBorrowerHomeowner', \
palette = 'colorblind');
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount', hue = 'IsBorrowerHomeowner', \
palette = 'colorblind');
g = sb.FacetGrid(data = df_samp, hue = 'IsBorrowerHomeowner', size = 6, palette = 'colorblind')
g.map(plt.scatter, 'LoanOriginalAmount', 'BorrowerAPR', alpha = 0.6)
g.add_legend()
plt.show()
sb.catplot(x="IsBorrowerHomeowner", y="BorrowerAPR", hue="CreditRisk",
kind="violin", inner=None,
palette="RdYlBu_r", data=df_samp);
# the same is also visible here
sb.catplot(x="CreditRisk", y="BorrowerAPR", hue="IsBorrowerHomeowner",
kind="violin", inner=None, split=True,
palette="colorblind", data=df_samp);
sb.catplot(x="CreditRisk", y="LoanOriginalAmount", hue="IsBorrowerHomeowner",
kind="violin", inner=None, split=True,
palette="colorblind", data=df_samp);
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'IncomeRange', palette = 'RdYlBu',\
linestyles = '--')
plt.legend(loc = 'right', bbox_to_anchor=(1.45, 0.5), ncol=1);
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount', hue = 'IncomeRange', \
palette = 'RdYlBu', linestyles = '--')
plt.legend(loc = 'right', bbox_to_anchor=(1.45, 0.5), ncol=1);
sb.catplot(y = 'IncomeRange', x = 'BorrowerAPR', hue = 'CreditRisk', kind = 'box', data = df_samp, size = 6, palette = 'RdYlBu_r');
sb.catplot(y = 'IncomeRange', x = 'LoanOriginalAmount', hue = 'CreditRisk', kind = 'box', data = df_samp, size = 6, palette = 'RdYlBu_r');
sb.catplot(y = 'IncomeRange', x = 'BorrowerAPR', hue = 'Term', kind = 'box', data = df_samp, size = 6, palette = 'pastel');
sb.catplot(y = 'IncomeRange', x = 'BorrowerAPR', hue = 'IsBorrowerHomeowner', kind = 'box', data = df_samp, size = 6, \
palette = 'colorblind');
plt.scatter(data = df_samp, x = 'LoanOriginalAmount', y = 'BorrowerAPR', alpha = 1/2, c = 'MonthlyLoanPayment', \
cmap = 'viridis_r')
plt.colorbar(label = 'Monthly Loan Payment')
plt.ylabel('Borrower APR')
plt.xlabel('Loan Original Amount')
plt.show()
plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'BorrowerAPR', alpha = 1/2, c = 'LoanOriginalAmount', \
cmap = 'viridis_r')
plt.colorbar(label = 'Loan Original Amount')
plt.ylabel('Borrower APR')
plt.xlabel('Monthly Loan Payment')
plt.xlim((20,2000))
plt.xscale('log')
plt.xticks([90, 173, 400, 1e3, 2e3], [90, 173, 400, '1k', '2k'])
plt.show()
plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'LoanOriginalAmount', alpha = 1/2, c = 'BorrowerAPR', \
cmap = 'viridis_r')
plt.colorbar(label = 'BorrowerAPR')
plt.ylabel('Loan original amount')
plt.xlabel('Monthly Loan Payment')
plt.xlim((20,2000))
plt.xscale('log')
plt.xticks([90, 173, 400, 1e3, 2e3], [90, 173, 400, '1k', '2k'])
plt.show()
# borrowers with verifiable income borrow higher amounts more often
g = sb.FacetGrid(data = df_samp, col = 'IncomeVerifiable',
size = 3, aspect = 1.5)
g.map(sb.scatterplot, 'LoanOriginalAmount', 'BorrowerAPR')
plt.show()
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'EmploymentStatus',\
markers = ['.','s','*','D','x','o','P','d'], color = 'blue', errwidth=0, linestyles='')
plt.legend(loc = 'right', bbox_to_anchor=(1.45, 0.5), ncol=1);
sb.catplot(y = 'EmploymentStatus', x = 'BorrowerAPR', hue = 'CreditRisk', order = \
['Full-time','Employed','Part-time','Self-employed','Retired','Other','Not employed','Not available'], \
kind = 'box', data = df_samp, size = 6, palette = 'RdYlBu_r');
sb.catplot(y = 'EmploymentStatus', x = 'LoanOriginalAmount', hue = 'CreditRisk', order = \
['Full-time','Employed','Part-time','Self-employed','Retired','Other','Not employed','Not available'], \
kind = 'box', data = df_samp, size = 6, palette = 'RdYlBu_r')
plt.xlim((0,30000));
sb.catplot(y = 'EmploymentStatus', x = 'BorrowerAPR', hue = 'IsBorrowerHomeowner', order = \
['Full-time','Employed','Part-time','Self-employed','Retired','Other','Not employed','Not available'], \
kind = 'box', data = df_samp, size = 6, palette = 'colorblind');
Differences among employment statuses are not very significant or consistent. This combined with knowing that employment categories are very imbalanced as was described earlier prevents from making any solid conclusions.
Credit risk category, as expected and hoped, is a major factor which segments the borrowers into clearly distinguishable groups.
Being assigned into one of the risk categories strongly determines the interest rate and the borrowed amount.
Borrowed amounts are increasing overall in the last few years, while the average interest rate is decreasing (especially for the medium and high risk groups).
House ownership or income range are also great for borrower segmentation, but only until combined with credit risk. This is because house ownership, income range and many other available variables feed into the credit score/credit rating/credit risk calculations and once a borrower is assigned his credit rating/credit risk, these variables are already accounted for and make no difference anymore.