Segmentation of Prosper Borrowers¶

Investigation Overview¶

The following analysis looks into a couple of variables which segment borrowers into groups which can expect to receive different interest rates and loan amounts.

Dataset Overview¶

The loans dataset contains 113k+ Prosper loans from the period 2005-2014 and 81 variables. Prosper is a US peer-to-peer landing company founded in 2005, so the dataset covers the first years of their business. The data collection went through changes in 2009, which we kept in mind during the investigation.

Credit rating and credit risk¶

Credit rating was engineered by merging of two columns: CreditGrade, which was used before 2009 and ProsperRating, which is in use since 2009. This allows us to work with the whole dataset.
Credit risk was further engineered from the credit rating by dividing ratings into 3 categories:

AA, A ratings are considered low risk,
B, C, D ratings are considered medium risk,
E, HR, NC ratings are considered high risk.

Loans are distributed in all categories, with the highest concentration of loans in the medium risk (B, C, and D ratings).
This is true throughout the years, but it is especially visible in 2013, which was a record year in terms of new loans. On the contrary, the crisis year 2009 is also very clear by the low number of new loans. Prosper was out of business for half a year in 2009 due to SEC cease order.

In [3]:

plt.figure(figsize=(16,6))
plt.suptitle("Credit Rating and Credit Risk distributions", size = 18)
plt.subplot(1,2,1)
sb.countplot(data = df, x = 'CreditRating', palette = 'RdYlBu_r')
plt.xlabel('Credit Rating')
plt.ylabel('Number of loans')
plt.ylim(0,25000)
sb.despine()
plt.subplot(1,2,2)
sb.countplot(data = df, x = 'CreditRisk', palette = 'RdYlBu_r')
plt.xlabel('Credit Risk')
plt.ylabel('Number of loans')
sb.despine();

In [4]:

sb.set_style("whitegrid")
plt.figure(figsize = [9,6])
sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'CreditRisk', palette = 'RdYlBu_r')
plt.xlabel('Loan origination year')
plt.ylabel('Number of loans')
plt.title('Credit Risk distribution in time')
sb.despine(left=True)
plt.legend(loc = 'left', bbox_to_anchor=(0.2, 0.88), ncol=1, title = 'Credit Risk')
plt.tight_layout();

Interest rate¶

We use the borrower annual percentage rate (APR) as the interest rate measure. Its values range roughly around 0.05 - 0.4.
APR of 0.4 seems very high, but there are 68 completed (i.e. fully repaid) loans with 0.4 or higher APR (and only 34 defaulted or charged off in this category), so it seems to be working for a certain group of borrowers.
Furhermore, the most frequent rates are also in the high end of 0.35 - 0.37.
The average interest rate was increasing steeply between 2007-2011, but it is decreasing since then.

In [5]:

sb.set_style("white")
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
bins = np.arange(0, df.BorrowerAPR.max()+0.01, 0.01)
plt.hist(data = df, x = 'BorrowerAPR', bins = bins)
plt.title('Borrower APR distribution')
plt.xlabel('Borrower APR')
plt.ylabel('Number of loans')
sb.despine()

plt.subplot(1,2,2)
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR')
plt.title('Borrower APR development in time')
plt.ylabel('Average Borrower APR')
plt.xlabel('Loan origination year')
plt.ylim((0,0.3))
sb.despine()
plt.tight_layout();

Loan original amount¶

Distribution of loan amount is multimodal, with the most popular loan amounts being 4k, 15k and 10k.
The average loan amount is increasing each year since 2010.

In [6]:

sb.set_style("white")
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
bins = np.arange(0, df['LoanOriginalAmount'].max()+900, 900)
plt.hist(data = df, x = 'LoanOriginalAmount', bins = bins)
plt.title('Loan original amount distribution')
plt.xlabel('Loan original amount ($)')
plt.ylabel('Number of loans')
sb.despine()

plt.subplot(1,2,2)
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount')
plt.title('Loan original amount development in time')
plt.ylabel('Average loan original amount ($)')
plt.xlabel('Loan origination year')
plt.ylim((0,14000))
sb.despine()
plt.tight_layout();

Monthly loan payment¶

The most frequent monthly payments are between USD 172 - 175.

Monthly payment has a strong relationship with the loan amount. It makes sense that when someone borrows more, they also generally need to pay higher monthly payments, which can be mitigated only by borrowing for a longer period of time.
For this reason, we will focus only on the loan amount in further analysis.

In [7]:

sb.set_style("ticks")
# a sample of 5000 from the data
sample = np.random.choice(df.shape[0], 5000, replace = False)
df_samp = df.iloc[sample,:]

plt.figure(figsize=[18, 6])
plt.suptitle('Monthly loan payment characteristics', size = 18)
plt.subplot(1,2,1)
log_binsize = 0.02
bins = 10 ** np.arange(1.3, np.log10(df['MonthlyLoanPayment'].max())+log_binsize, log_binsize)
plt.hist(data = df, x = 'MonthlyLoanPayment', bins = bins)
plt.xscale('log')
plt.xlabel('Monthly loan payment ($)')
plt.ylabel('Number of loans')
plt.xticks([90, 173, 400, 1e3, 2e3], [90, 173, 400, '1k', '2k'])
sb.despine()

plt.subplot(1,2,2)
plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'LoanOriginalAmount', alpha = 1/10)
plt.ylabel('Loan original amount ($)')
plt.xlabel('Monthly loan payment ($)')
plt.xlim((20,2000))
plt.xscale('log')
plt.xticks([90, 173, 400, 1e3, 2e3], [90, 173, 400, '1k', '2k'])
sb.despine()

plt.show();

In [8]:

plt.figure(figsize=[10, 6])
plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'BorrowerAPR', alpha = 0.7, c = 'LoanOriginalAmount', \
            cmap = 'viridis_r')
plt.colorbar(label = 'Loan Original Amount ($)')
plt.ylabel('Borrower APR')
plt.xlabel('Monthly Loan Payment ($)')
plt.title('Interaction of monthly payment & loan amount & interest rate')
plt.xlim((20,2000))
plt.xscale('log')
plt.xticks([50, 173, 400, 1e3, 2e3], [50, 173, 400, '1k', '2k'])
sb.despine()
plt.show()

Segments of borrowers by credit risk¶

Interest rate¶

First we look into how credit risk/rating separates borrowers into distinct groups in terms of the interest rate which they can expect on their loans and how this develops in time.

Different risk categories have clearly different interest rates, which is what we would expect. Lots of borrower details are used to calculate his credit rating/risk to then be able to set an appropriate rate and loan amount for him, which is proven by the data.

The interest rate is decreasing in the last years, as we already know, but this is driven mostly by the higher risk ratings.

In [9]:

sb.set_style("white")
plt.figure(figsize = (18,6))
plt.suptitle('Interest rate differentiation by credit risk', size = 18)

plt.subplot(1,2,1)
sb.violinplot(data=df_samp, x="CreditRisk", y="BorrowerAPR", palette = 'RdYlBu_r')
sb.despine()

plt.subplot(1,2,2)
bins = np.arange(0.05, 0.45,0.01)
plt.hist(df_samp[df_samp['CreditRisk']=='low']['BorrowerAPR'], alpha = 0.5, color = sb.color_palette('RdYlBu_r')[0], bins = bins, label = 'low')
plt.hist(df_samp[df_samp['CreditRisk']=='medium']['BorrowerAPR'], alpha = 0.5, color = sb.color_palette('RdYlBu_r')[3], bins = bins, label = 'medium')
plt.hist(df_samp[df_samp['CreditRisk']=='high']['BorrowerAPR'], alpha = 0.5, color = sb.color_palette('RdYlBu_r')[5], bins = bins, label = 'high')
sb.despine()
plt.xlabel('Borrower APR')
plt.ylabel('Number of loans')
plt.legend(title='Credit Risk')

plt.show();

In [10]:

plt.figure(figsize = (17,7))
plt.suptitle('Interest rate differentiation by credit risk/rating in time', size = 18)

plt.subplot(1,2,1)
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'CreditRisk', palette = 'RdYlBu_r');
plt.xlabel('Loan origination year')
plt.ylabel('Average Borrower APR')
sb.despine()

plt.subplot(1,2,2)
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'CreditRating', palette = 'RdYlBu_r')
plt.legend(loc = 'right', bbox_to_anchor=(1.25, 0.5), ncol=1, title = 'Credit Rating', frameon = False)
plt.xlabel('Loan origination year')
plt.ylabel('Average Borrower APR')
sb.despine();

Segments of borrowers by credit risk¶

Loan original amount¶

Next, we look into how credit risk of a borrower determines the amount that he can borrow.
Higher risk groups borrow lower amounts and this holds true in all years.
The average loan amount steadily increases across all groups.

In [11]:

sb.set_style("white")
plt.figure(figsize = (18,6))
plt.suptitle('Loan amount differentiation by credit risk', size = 18)

plt.subplot(1,2,1)
sb.violinplot(data=df_samp, x="CreditRisk", y="LoanOriginalAmount", palette = 'RdYlBu_r')
plt.ylabel('Loan original amount ($)')
sb.despine()

plt.subplot(1,2,2)
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount', hue = 'CreditRisk', palette = 'RdYlBu_r');
plt.xlabel('Loan origination year')
plt.ylabel('Average loan original amount ($)')
sb.despine()

plt.show();

Segments of borrowers by income range¶

As the last point, we will look at how the income range of a borrower determines the amount that he can borrow and the interest rate.
Higher income groups borrow higher amounts for lower interest and this holds true in all years.

When we look at income range and credit risk combined, the income range effect disappears. This is because income range is one of the determinants of the credit risk, so its effect is already included once the credit risk group is assigned. In other words, a high risk borrower earning USD 100k+ is still a high risk borrower (and they really exist, as the visual shows).
The income range segmentation exists due to different proportions of credit risk groups in individual income ranges.

In [12]:

plt.figure(figsize = (14,7))
plt.suptitle('Segmentation by income range', size = 18)

plt.subplot(1,2,1)
ax = sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'IncomeRange', palette = 'RdYlBu',\
             linestyles = '--')
plt.xlabel('Loan origination year')
plt.ylabel('Average borrower APR')
legend = ax.legend()
legend.remove()
sb.despine()

plt.subplot(1,2,2)
sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount', hue = 'IncomeRange', \
             palette = 'RdYlBu', linestyles = '--')
plt.xlabel('Loan origination year')
plt.ylabel('Average loan original amount ($)')
plt.legend(loc = 'right', bbox_to_anchor=(1.45, 0.5), ncol=1, title='Income Range', frameon=False)
sb.despine()

plt.show();

In [13]:

plt.figure(figsize = (14,7))
plt.suptitle('Income range and credit risk distribution', size = 18)

plt.subplot(1,2,1)
ax = sb.boxplot(data = df_samp, y = 'IncomeRange', x = 'BorrowerAPR', hue = 'CreditRisk', palette = 'RdYlBu_r')
ax.set(ylabel='Income range', xlabel='Borrower APR')
sb.despine()
legend = ax.legend()
legend.remove()

plt.subplot(1,2,2)
sb.countplot(data = df, y = 'IncomeRange', hue = 'CreditRisk', palette = 'RdYlBu_r')
plt.ylabel('')
plt.xlabel('Number of loans')
sb.despine();