Prosper Loan Data Analysis¶

Preliminary Wrangling¶

Prosper loan dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.

Prosper was founded in 2005 as the first peer-to-peer lending marketplace in the United States. Since then, Prosper has facilitated more than USD 12 billion in loans to more than 770,000 people.
Through Prosper, people can invest in each other in a way that is financially and socially rewarding. Borrowers apply online for a fixed-rate, fixed-term loan between USD 2,000 and USD 40,000. Individuals and institutions can invest in the loans and earn attractive returns. Prosper handles all loan servicing on behalf of the matched borrowers and investors. (source: https://www.prosper.com/about)

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
from pywaffle import Waffle
import plotly.express as px

pd.set_option('display.max_colwidth', None) # to display the whole strings and not collapse them
pd.set_option('display.max_columns', None) # to not collapse columns when viewing data
pd.set_option('display.max_rows', None) # to display all rows in the output

import warnings
warnings.filterwarnings("ignore") # warnings off

df_full = pd.read_csv('prosperLoanData.csv')
df_full.head()

df_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss                        84853 non-null   float64
 12  EstimatedReturn                      84853 non-null   float64
 13  ProsperRating (numeric)              84853 non-null   float64
 14  ProsperRating (Alpha)                84853 non-null   object 
 15  ProsperScore                         84853 non-null   float64
 16  ListingCategory (numeric)            113937 non-null  int64  
 17  BorrowerState                        108422 non-null  object 
 18  Occupation                           110349 non-null  object 
 19  EmploymentStatus                     111682 non-null  object 
 20  EmploymentStatusDuration             106312 non-null  float64
 21  IsBorrowerHomeowner                  113937 non-null  bool   
 22  CurrentlyInGroup                     113937 non-null  bool   
 23  GroupKey                             13341 non-null   object 
 24  DateCreditPulled                     113937 non-null  object 
 25  CreditScoreRangeLower                113346 non-null  float64
 26  CreditScoreRangeUpper                113346 non-null  float64
 27  FirstRecordedCreditLine              113240 non-null  object 
 28  CurrentCreditLines                   106333 non-null  float64
 29  OpenCreditLines                      106333 non-null  float64
 30  TotalCreditLinespast7years           113240 non-null  float64
 31  OpenRevolvingAccounts                113937 non-null  int64  
 32  OpenRevolvingMonthlyPayment          113937 non-null  float64
 33  InquiriesLast6Months                 113240 non-null  float64
 34  TotalInquiries                       112778 non-null  float64
 35  CurrentDelinquencies                 113240 non-null  float64
 36  AmountDelinquent                     106315 non-null  float64
 37  DelinquenciesLast7Years              112947 non-null  float64
 38  PublicRecordsLast10Years             113240 non-null  float64
 39  PublicRecordsLast12Months            106333 non-null  float64
 40  RevolvingCreditBalance               106333 non-null  float64
 41  BankcardUtilization                  106333 non-null  float64
 42  AvailableBankcardCredit              106393 non-null  float64
 43  TotalTrades                          106393 non-null  float64
 44  TradesNeverDelinquent (percentage)   106393 non-null  float64
 45  TradesOpenedLast6Months              106393 non-null  float64
 46  DebtToIncomeRatio                    105383 non-null  float64
 47  IncomeRange                          113937 non-null  object 
 48  IncomeVerifiable                     113937 non-null  bool   
 49  StatedMonthlyIncome                  113937 non-null  float64
 50  LoanKey                              113937 non-null  object 
 51  TotalProsperLoans                    22085 non-null   float64
 52  TotalProsperPaymentsBilled           22085 non-null   float64
 53  OnTimeProsperPayments                22085 non-null   float64
 54  ProsperPaymentsLessThanOneMonthLate  22085 non-null   float64
 55  ProsperPaymentsOneMonthPlusLate      22085 non-null   float64
 56  ProsperPrincipalBorrowed             22085 non-null   float64
 57  ProsperPrincipalOutstanding          22085 non-null   float64
 58  ScorexChangeAtTimeOfListing          18928 non-null   float64
 59  LoanCurrentDaysDelinquent            113937 non-null  int64  
 60  LoanFirstDefaultedCycleNumber        16952 non-null   float64
 61  LoanMonthsSinceOrigination           113937 non-null  int64  
 62  LoanNumber                           113937 non-null  int64  
 63  LoanOriginalAmount                   113937 non-null  int64  
 64  LoanOriginationDate                  113937 non-null  object 
 65  LoanOriginationQuarter               113937 non-null  object 
 66  MemberKey                            113937 non-null  object 
 67  MonthlyLoanPayment                   113937 non-null  float64
 68  LP_CustomerPayments                  113937 non-null  float64
 69  LP_CustomerPrincipalPayments         113937 non-null  float64
 70  LP_InterestandFees                   113937 non-null  float64
 71  LP_ServiceFees                       113937 non-null  float64
 72  LP_CollectionFees                    113937 non-null  float64
 73  LP_GrossPrincipalLoss                113937 non-null  float64
 74  LP_NetPrincipalLoss                  113937 non-null  float64
 75  LP_NonPrincipalRecoverypayments      113937 non-null  float64
 76  PercentFunded                        113937 non-null  float64
 77  Recommendations                      113937 non-null  int64  
 78  InvestmentFromFriendsCount           113937 non-null  int64  
 79  InvestmentFromFriendsAmount          113937 non-null  float64
 80  Investors                            113937 non-null  int64  
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB

Using 81 variables for the analysis is not realistic, we will select around 15 of the most interesting ones.
We will use the data dictionary to assist with the selection of relevant variables:

Variable	Description
ListingKey	Unique key for each listing, same value as the 'key' used in the listing object in the API.
ListingNumber	The number that uniquely identifies the listing to the public as displayed on the website.
ListingCreationDate	The date the listing was created.
CreditGrade	The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
Term	The length of the loan expressed in months.
LoanStatus	The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
ClosedDate	Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses.
BorrowerAPR	The Borrower's Annual Percentage Rate (APR) for the loan.
BorrowerRate	The Borrower's interest rate for this loan.
LenderYield	The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee.
EstimatedEffectiveYield	Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009.
EstimatedLoss	Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009.
EstimatedReturn	The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009.
ProsperRating (numeric)	The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
ProsperRating (Alpha)	The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.
ProsperScore	A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
ListingCategory	The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
BorrowerState	The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
Occupation	The Occupation selected by the Borrower at the time they created the listing.
EmploymentStatus	The employment status of the borrower at the time they posted the listing.
EmploymentStatusDuration	The length in months of the employment status at the time the listing was created.
IsBorrowerHomeowner	A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
CurrentlyInGroup	Specifies whether or not the Borrower was in a group at the time the listing was created.
GroupKey	The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation.
DateCreditPulled	The date the credit profile was pulled.
CreditScoreRangeLower	The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
CreditScoreRangeUpper	The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
FirstRecordedCreditLine	The date the first credit line was opened.
CurrentCreditLines	Number of current credit lines at the time the credit profile was pulled.
OpenCreditLines	Number of open credit lines at the time the credit profile was pulled.
TotalCreditLinespast7years	Number of credit lines in the past seven years at the time the credit profile was pulled.
OpenRevolvingAccounts	Number of open revolving accounts at the time the credit profile was pulled.
OpenRevolvingMonthlyPayment	Monthly payment on revolving accounts at the time the credit profile was pulled.
InquiriesLast6Months	Number of inquiries in the past six months at the time the credit profile was pulled.
TotalInquiries	Total number of inquiries at the time the credit profile was pulled.
CurrentDelinquencies	Number of accounts delinquent at the time the credit profile was pulled.
AmountDelinquent	Dollars delinquent at the time the credit profile was pulled.
DelinquenciesLast7Years	Number of delinquencies in the past 7 years at the time the credit profile was pulled.
PublicRecordsLast10Years	Number of public records in the past 10 years at the time the credit profile was pulled.
PublicRecordsLast12Months	Number of public records in the past 12 months at the time the credit profile was pulled.
RevolvingCreditBalance	Dollars of revolving credit at the time the credit profile was pulled.
BankcardUtilization	The percentage of available revolving credit that is utilized at the time the credit profile was pulled.
AvailableBankcardCredit	The total available credit via bank card at the time the credit profile was pulled.
TotalTrades	Number of trade lines ever opened at the time the credit profile was pulled.
TradesNeverDelinquent	Number of trades that have never been delinquent at the time the credit profile was pulled.
TradesOpenedLast6Months	Number of trades opened in the last 6 months at the time the credit profile was pulled.
DebtToIncomeRatio	The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
IncomeRange	The income range of the borrower at the time the listing was created.
IncomeVerifiable	The borrower indicated they have the required documentation to support their income.
StatedMonthlyIncome	The monthly income the borrower stated at the time the listing was created.
LoanKey	Unique key for each loan. This is the same key that is used in the API.
TotalProsperLoans	Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans.
TotalProsperPaymentsBilled	Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans.
OnTimeProsperPayments	Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans.
ProsperPaymentsLessThanOneMonthLate	Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
ProsperPaymentsOneMonthPlusLate	Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
ProsperPrincipalBorrowed	Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
ProsperPrincipalOutstanding	Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
ScorexChangeAtTimeOfListing	Borrower's credit score change at the time the credit profile was pulled. This will be the change relative to the borrower's last Prosper loan. This value will be null if the borrower had no prior loans.
LoanCurrentDaysDelinquent	The number of days delinquent.
LoanFirstDefaultedCycleNumber	The cycle the loan was charged off. If the loan has not charged off the value will be null.
LoanMonthsSinceOrigination	Number of months since the loan originated.
LoanNumber	Unique numeric value associated with the loan.
LoanOriginalAmount	The origination amount of the loan.
LoanOriginationDate	The date the loan was originated.
LoanOriginationQuarter	The quarter in which the loan was originated.
MemberKey	The unique key that is associated with the borrower. This is the same identifier that is used in the API member object.
MonthlyLoanPayment	The scheduled monthly loan payment.
LP_CustomerPayments	Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
LP_CustomerPrincipalPayments	Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
LP_InterestandFees	Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries.
LP_ServiceFees	Cumulative service fees paid by the investors who have invested in the loan.
LP_CollectionFees	Cumulative collection fees paid by the investors who have invested in the loan.
LP_GrossPrincipalLoss	The gross charged off amount of the loan.
LP_NetPrincipalLoss	The principal that remains uncollected after any recoveries.
LP_NonPrincipalRecoverypayments	The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal.
PercentFunded	Percent the listing was funded.
Recommendations	Number of recommendations the borrower had at the time the listing was created.
InvestmentFromFriendsCount	Number of friends that made an investment in the loan.
InvestmentFromFriendsAmount	Dollar amount of investments that were made by friends.
Investors	The number of investors that funded the loan.

Dealing with duplicates¶

# there seem to be multiple loan identifiers
df_full.ListingKey.nunique()

113066

df_full.ListingNumber.nunique()

113066

df_full.LoanKey.nunique()

113066

We can choose any of the IDs to be the index of our dataset since they all seem to represent the loan ID.
However, we have a slightly lower number of unique keys than data points.
Checking the above unique identifiers suggests there are some duplicates in the data, let's clean these first:

df_full[df_full.LoanKey.duplicated() == True].count()['LoanKey']

871

df_full[df_full.LoanKey.duplicated() == True]['LoanKey'].head(5)

9       02163700809231365A56A1C
999     CB643706582969191E8C26B
2539    DC303708663304668E74587
4942    8D84369399064302947D8B5
5812    129C369881799628360C904
Name: LoanKey, dtype: object

# different in ProsperScore, where the value also should not be > 10 according to the dictionary and here we have 11
df_full[df_full.LoanKey == '02163700809231365A56A1C']

# the same issue as above
df_full[df_full.LoanKey == 'CB643706582969191E8C26B']

df_full.duplicated().sum()

0

# drop the problematic ProsperScore column
df_full.drop('ProsperScore', axis=1, inplace=True)

df_full.duplicated().sum()

871

# drop duplicate rows
df_full.drop_duplicates(inplace=True)
df_full[df_full.LoanKey.duplicated() == True].count()['LoanKey']

0

df_full.LoanKey.nunique()

113066

# all loans are now included only once
df_full.shape

(113066, 80)

Variables selection and basic data cleaning¶

The main loan and borrower indicators were selected for further analysis based on the data dictionary.
The full selected list can be seen below:

# this variable subset will be a starting point for further cleaning and investigation of relevant data
df = df_full[['LoanKey', 'LoanOriginationDate', 'ClosedDate', 'LoanStatus', 'Term', 'LoanOriginalAmount', 'MonthlyLoanPayment',\
              'ListingCategory (numeric)', 'BorrowerAPR', 'BorrowerRate', \
              'CreditGrade', 'ProsperRating (Alpha)', 'CreditScoreRangeLower', 'CreditScoreRangeUpper',\
              'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'DebtToIncomeRatio', 'EmploymentStatus',\
              'IsBorrowerHomeowner']]
df.head()

# data type changes follow
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 113066 entries, 0 to 113936
Data columns (total 20 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   LoanKey                    113066 non-null  object 
 1   LoanOriginationDate        113066 non-null  object 
 2   ClosedDate                 55076 non-null   object 
 3   LoanStatus                 113066 non-null  object 
 4   Term                       113066 non-null  int64  
 5   LoanOriginalAmount         113066 non-null  int64  
 6   MonthlyLoanPayment         113066 non-null  float64
 7   ListingCategory (numeric)  113066 non-null  int64  
 8   BorrowerAPR                113041 non-null  float64
 9   BorrowerRate               113066 non-null  float64
 10  CreditGrade                28953 non-null   object 
 11  ProsperRating (Alpha)      83982 non-null   object 
 12  CreditScoreRangeLower      112475 non-null  float64
 13  CreditScoreRangeUpper      112475 non-null  float64
 14  IncomeRange                113066 non-null  object 
 15  IncomeVerifiable           113066 non-null  bool   
 16  StatedMonthlyIncome        113066 non-null  float64
 17  DebtToIncomeRatio          104594 non-null  float64
 18  EmploymentStatus           110811 non-null  object 
 19  IsBorrowerHomeowner        113066 non-null  bool   
dtypes: bool(2), float64(7), int64(3), object(8)
memory usage: 16.6+ MB

# rename columns with spaces
df.rename(columns={'ListingCategory (numeric)': 'ListingCategory', 'ProsperRating (Alpha)': 'ProsperRating'}, inplace=True)

# LoanKey as index
df.set_index('LoanKey', inplace=True)

# date columns as dates
df.LoanOriginationDate = pd.to_datetime(df.LoanOriginationDate)
df.ClosedDate = pd.to_datetime(df.ClosedDate)

df.IncomeRange.value_counts()

$25,000-49,999    31940
$50,000-74,999    30749
$100,000+         17188
$75,000-99,999    16780
Not displayed      7741
$1-24,999          7241
Not employed        806
$0                  621
Name: IncomeRange, dtype: int64

# reduce no. of IncomeRange categories by 1
df.IncomeRange.replace({'$0': 'Not employed'}, inplace=True)

# CreditGrade, ProsperRating, IncomeRange as ordered categorical vars
var_dict = {'CreditGrade': ['AA', 'A', 'B', 'C', 'D', 'E', 'HR', 'NC'],
           'ProsperRating': ['AA', 'A', 'B', 'C', 'D', 'E', 'HR', 'NC'],
           'IncomeRange': ['Not displayed', 'Not employed', '$1-24,999', 
                           '$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+']}
for var in var_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered=True, categories = var_dict[var])
    df[var] = df[var].astype(ordered_var)

df.describe()

What is the structure of your dataset?¶

We are dealing with a loan database of over 113k loans with 81 variables. To be able to do a meaningful analysis, we had to select a subset of these variables.

What is/are the main feature(s) of interest in your dataset?¶

The interest rate of the loan, its amount, and the credit rating of the borrower are the main features of interest.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

The supporting features of interest are the loan characteristics on the one hand, such as the loan amount, the date when the loan was taken, the term of the loan, the status of the loan, and some borrower characteristics on the other hand, such as the income, the employment and housing situation and debt to income ration of the borrower.

Univariate Exploration¶

df.head(3)

df.shape

(113066, 19)

# all histograms in one place for a quick overview

# some vars are very skewed and will require transformations

df.hist(['Term', 'LoanOriginalAmount', 'MonthlyLoanPayment', 'ListingCategory', 'BorrowerAPR', 'BorrowerRate',\
         'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'StatedMonthlyIncome', 'DebtToIncomeRatio'],\
        figsize=(10,10));

We will take look at all selected variables one by one to learn more details about their distributions.

LoanOriginationDate¶

df.LoanOriginationDate.describe()

count                  113066
unique                   1873
top       2013-11-13 00:00:00
freq                      469
first     2005-11-15 00:00:00
last      2014-03-12 00:00:00
Name: LoanOriginationDate, dtype: object

df[df.LoanOriginationDate < '2009-01-01 00:00:00'].count()['LoanOriginationDate']

28940

Dataset includes loans from period 2005-2014.
28,940 loans are from the period 2005-2008.

# 2009 was a critical year for Prosper
df.groupby(df.LoanOriginationDate.dt.year).count()['LoanOriginationDate'].plot(kind='bar')
plt.ylabel('Number of loans')
plt.xticks(rotation=0)
plt.tight_layout();

df['LoanYear'] = df['LoanOriginationDate'].dt.year
df.LoanYear.dtype

dtype('int64')

# let's look at the tails: 2005 and 2014 to know why there are fewer loans on those years
df[df.LoanOriginationDate.dt.year == 2005].count()['LoanOriginationDate']

22

# our data ends in March 2014
df.LoanOriginationDate.max()

Timestamp('2014-03-12 00:00:00')

Prosper started business in 2005, which is also the oldest loans which we have in the data set. However, there are only 22 loans from 2005.
Prosper stopped their business between Nov 2008 - July 2009 due to a cease order by SEC, which is very much reflected in the low number of loans in 2009.
Borrowers may come from 47 US states, bidders can come only from 28 US states since the relaunch. (source: https://en.wikipedia.org/wiki/Prosper_Marketplace)
Since 2010, the number of new loans increases steadily, with a record number of new loans in 2013.
We only have two months of full data for 2014.

ClosedDate¶

# 55076 loans are already closed
df.ClosedDate.notnull().sum()

55076

df.groupby(df.ClosedDate.dt.year).count()['ClosedDate'].plot(kind='bar')
plt.ylabel('Number of loans')
plt.xticks(rotation=0)
plt.tight_layout();

# months between loan origination and closing date
df['MonthsToClosed'] = ((df.ClosedDate - df.LoanOriginationDate) / np.timedelta64(1, 'M'))

df['MonthsToClosed'].describe()

count    55076.000000
mean        19.291197
std         11.615265
min        -45.536869
25%          9.232222
50%         17.051685
75%         31.015011
max         51.155055
Name: MonthsToClosed, dtype: float64

df[df['MonthsToClosed'] < 0]

There are 2 loans with closed date prior to loan origination date, which we will delete:

df.drop(df[df['MonthsToClosed'] < 0].index, inplace=True)
df.shape

(113064, 21)

bins = np.arange(0, df['MonthsToClosed'].max()+2, 2)
plt.hist(data = df, x = 'MonthsToClosed', bins = bins)
plt.title('Distribution of number of months to reach closed status');

df[['LoanOriginationDate', 'ClosedDate', 'MonthsToClosed', 'Term', 'LoanStatus']].head()

There are 55k of closed loans. Generated column MonthsToClosed looks at number of months between loan origination date and closed date.
The distribution of MonthsToClosed is bimodal with one peak around 12 months and another peak around 36 months.

LoanStatus¶

df.LoanStatus.value_counts()

Current                   55730
Completed                 38061
Chargedoff                11992
Defaulted                  5017
Past Due (1-15 days)        800
Past Due (31-60 days)       361
Past Due (61-90 days)       311
Past Due (91-120 days)      304
Past Due (16-30 days)       265
FinalPaymentInProgress      203
Past Due (>120 days)         16
Cancelled                     4
Name: LoanStatus, dtype: int64

It makes sense to divide loan status further into active and non-active loans because these are two distinct groups.
ClosedDate can used for this.

df.loc[df['ClosedDate'].isnull(), 'LoanStatusActive'] = 1
df.loc[df['ClosedDate'].notnull(), 'LoanStatusActive'] = 0
df.LoanStatusActive.value_counts()

1.0    57990
0.0    55074
Name: LoanStatusActive, dtype: int64

df.LoanStatusActive = df['LoanStatusActive'].astype('bool')
df.LoanStatusActive.dtype

dtype('bool')

plt.figure(figsize=(15,5))
plt.subplot(1,2,1)

# set base color for plots
base_color = sb.color_palette()[0]

sb.countplot(data = df[df.LoanStatusActive == 1], y = 'LoanStatus', color = base_color, orient = 'h')
plt.xticks(rotation = 0)
plt.title('Active loans')

plt.subplot(1,2,2)
order = ['FinalPaymentInProgress', 'Past Due (1-15 days)', 'Past Due (16-30 days)', \
         'Past Due (31-60 days)', 'Past Due (61-90 days)', 'Past Due (91-120 days)', 'Past Due (>120 days)']
sb.countplot(data = df[df.LoanStatus.isin(order)], y = 'LoanStatus', order = order, color = base_color, orient = 'h')
plt.xticks(rotation = 0)
plt.title('Active loans (current status excluded)')
plt.tight_layout();

Among active loans, the vast majority are loans without any issues, i.e. in the "current" status.
Among active loans not in the current status, the highest amount is in default for 1-15 days.

plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sb.countplot(data = df[df.LoanStatusActive == 0], y = 'LoanStatus', color = base_color)
plt.xticks(rotation = 0)
plt.title('Non-Active loans')

plt.subplot(1,2,2)
sb.countplot(data = df[df.LoanStatus.isin(['Defaulted', 'Chargedoff', 'Cancelled'])], y = 'LoanStatus', \
             color = base_color)
plt.xticks(rotation = 0)
plt.title('Non-Active loans (completed status excluded)')
plt.tight_layout();

Most non-active loans are in "completed" status, i.e. fully repaid.
In the not-repaid group of non-active loans, there are more charged off loans (i.e. those were the chance of repayment is minimal) than defaulted ones.

df[df.LoanStatus == 'Chargedoff'].count()['LoanStatus'] / len(df)

0.10606382225995896

Charged off loans make 10.6% of all loans.

# make status an ordered categorical var
ordered_var = pd.api.types.CategoricalDtype(ordered=True, categories = ['Current', 'FinalPaymentInProgress',
                                                                        'Past Due (1-15 days)', 'Past Due (16-30 days)', 
                                                                        'Past Due (31-60 days)', 'Past Due (61-90 days)', 
                                                                        'Past Due (91-120 days)', 'Past Due (>120 days)',
                                                                        'Completed', 'Defaulted', 'Chargedoff', 'Cancelled'])
df['LoanStatus'] = df['LoanStatus'].astype(ordered_var)

Term¶

sb.countplot(data = df, x = 'Term', palette = 'pastel');

data = {'12': df[df['Term'] == 12].count()['Term'], 
        '36': df[df['Term'] == 36].count()['Term'], '60': df[df['Term'] == 60].count()['Term']}
fig = plt.figure(
    FigureClass=Waffle, 
    rows=9, 
    columns=9, 
    values=data,
    figsize=(4, 4),
    legend={'loc': 'upper left', 'bbox_to_anchor': (1.1, 1)},
    colors=("lightblue", "peachpuff", "lightgreen")
)
plt.title('Term distribution')
plt.show()

There are only 3 distinct loan terms:

1 year (12 months)
3 years (36 months)
5 years (60 months)

The vast majority of loans is for 36 months.

LoanOriginalAmount¶

plt.figure(figsize=(13,4))
plt.suptitle('Loan original amount distribution')
plt.subplot(1,2,1)
bins = np.arange(0, df['LoanOriginalAmount'].max()+900, 900)
plt.hist(data = df, x = 'LoanOriginalAmount', bins = bins)

plt.subplot(1,2,2)
sb.distplot(df['LoanOriginalAmount']);

np.sort(df.LoanOriginalAmount.value_counts().nlargest(8).index)

array([ 2000,  3000,  4000,  5000, 10000, 15000, 20000, 25000],
      dtype=int64)

df.LoanOriginalAmount.value_counts().nlargest(8)

4000     14207
15000    12232
10000    10956
5000      6953
2000      6042
3000      5728
25000     3588
20000     3234
Name: LoanOriginalAmount, dtype: int64

Loan amounts are right skewed with several peaks as shown in the histogram.
The most frequently loaned amount is 4000.

MonthlyLoanPayment¶

bins = np.arange(0, df['MonthlyLoanPayment'].max()+100, 100)
plt.hist(data = df, x = 'MonthlyLoanPayment', bins = bins)
plt.xlabel('Monthly loan payment')
plt.ylabel('Number of loans');

df.MonthlyLoanPayment.mode()

0    173.71
dtype: float64

df.MonthlyLoanPayment.value_counts().nlargest(5)

173.71    2423
0.00       935
172.76     530
86.85      472
174.20     460
Name: MonthlyLoanPayment, dtype: int64

df.MonthlyLoanPayment.describe()

count    113064.000000
mean        271.936536
std         192.549541
min           0.000000
25%         130.950000
50%         217.370000
75%         370.570000
max        2251.510000
Name: MonthlyLoanPayment, dtype: float64

Monthly payment distribution is right skewed and has a long tail.
Let's look at its log transformed distribution:

log_binsize = 0.02
bins = 10 ** np.arange(1.3, np.log10(df['MonthlyLoanPayment'].max())+log_binsize, log_binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = df, x = 'MonthlyLoanPayment', bins = bins)
plt.xscale('log')
plt.xlabel('Monthly loan payment')
plt.ylabel('Number of loans')
plt.xticks([90, 173, 400, 1e3, 2e3], [90, 173, 400, '1k', '2k'])
plt.show()

Now we can see the distribution more clearly. Many loans have very low loan payments and the mode of 173 is clearly visible.

ListingCategory¶

# keep ListingCategory as int to easily keep the same order in all plots and have legible axis labels
sb.countplot(data = df, x = 'ListingCategory', color = base_color);

The dictionary for listing categories is the following:

Not Available
Debt Consolidation
Home Improvement
Business
Personal Loan
Student Use
Auto
Other
Baby&Adoption
Boat
Cosmetic Procedure
Engagement Ring
Green Loans
Household Expenses
Large Purchases
Medical/Dental
Motorcycle
RV
Taxes
Vacation
Wedding Loans

By far the most loans are debt consolidation and many loans do not have a category defined.
The distribution shows that ListingCategory might not be a very interesting variable for further analysis after all and if, then only in the top 5 categories. Below top 5 categories, there is a very small loan sample.

BorrowerAPR and BorrowerRate¶

plt.figure(figsize=(13,4))
plt.suptitle("Borrower's Annual Percentage Rate distribution")
plt.subplot(1,2,1)
bins = np.arange(0, df.BorrowerAPR.max()+0.02, 0.02)
plt.hist(data = df, x = 'BorrowerAPR', bins = bins)
plt.subplot(1,2,2)
sb.distplot(df['BorrowerAPR']);

plt.figure(figsize=(13,4))
plt.suptitle("Borrower's interest rate distribution")
plt.subplot(1,2,1)
bins = np.arange(0, df.BorrowerRate.max()+0.02, 0.02)
plt.hist(data = df, x = 'BorrowerRate', bins = bins)
plt.subplot(1,2,2)
sb.distplot(df['BorrowerRate']);

plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
g = sb.distplot(df['BorrowerAPR'])
plt.subplot(1,2,2)
sb.distplot(df['BorrowerRate'])
plt.ylim(g.get_ylim());

df[['BorrowerAPR', 'BorrowerRate']].describe()

BorrowerAPR and BorrowerRate have very similar distributions.
APR rate is somewhat larger because it includes fees.

# the most common rates are very high, as is also seen by the peak in the histogram
df['BorrowerAPR'].value_counts().nlargest(5)

0.35797    3672
0.35643    1644
0.37453    1260
0.30532     902
0.29510     747
Name: BorrowerAPR, dtype: int64

It is somewhat suprising that 25% of loans have APR larger than 28%, which seems very high.
However, looking at the very extreme end of APR 0.4, there are 68 fully repaid loans and just 34 charged-off or defaulted in this category:

df[(df['BorrowerAPR']>= 0.4) & (df['LoanStatus'] == 'Completed')].count()['Term']

68

df[(df['BorrowerAPR']>= 0.4) & ((df['LoanStatus'] == 'Defaulted') | (df['LoanStatus'] == 'Chargedoff'))].count()['Term']

34

BorrowerAPR and BorrowerRate contain the same information, let's keep just one of them and drop the other one:

df.drop('BorrowerRate', axis = 1, inplace = True)

CreditGrade and ProsperRating¶

These two columns belong together because they are mutually separated by age of loans. We can merge these two columns to get a rating for all loans.
To prove this point and look at the final rating distribution, we have to do some bivariate plotting already at this point.
Further, we will create another additional column which will group the ratings into risk categories.

plt.figure(figsize=(13,4))
plt.suptitle("Credit rating variables distribution")
plt.subplot(1,2,1)
sb.countplot(data = df, x = 'CreditGrade', palette = 'RdYlBu_r')
plt.subplot(1,2,2)
sb.countplot(data = df, x = 'ProsperRating', palette = 'RdYlBu_r');

There are fewer loans with CreditGrade because it was only used until 2009 and since then, ProsperRating is used:

plt.figure(figsize=(13,4))
plt.suptitle("Credit ratings amounts by year")
plt.subplot(1,2,1)
df.groupby(df.LoanOriginationDate.dt.year).count()['CreditGrade'].plot(kind='bar')
plt.title('CreditGrade')
plt.xticks(rotation = 0)

plt.subplot(1,2,2)
df.groupby(df.LoanOriginationDate.dt.year).count()['ProsperRating'].plot(kind='bar')
plt.title('ProsperRating')
plt.xticks(rotation = 0)
plt.tight_layout();

# create new variable CreditRating which combines the two rating columns
df['CreditRating'] = np.where(df.CreditGrade.isnull(), df.ProsperRating, df.CreditGrade)

# test for correctness
df[['CreditGrade', 'ProsperRating', 'CreditRating']].sample(15)

# make the new column CreditRating an ordinal categorical variable
df.CreditRating = df['CreditRating'].astype(pd.api.types.CategoricalDtype(ordered=True, categories = var_dict['CreditGrade']))

sb.countplot(data = df, x = 'CreditRating', palette = 'RdYlBu_r')
plt.xlabel('Credit Rating')
plt.ylabel('Number of loans');

# 131 loans are missing a rating
df.CreditRating.isnull().sum()

131

# because they did not have a rating in either CreditGrade or ProsperRating columns
df[df.CreditRating.isnull()][['LoanOriginationDate', 'CreditGrade', 'ProsperRating', 'CreditRating']].sample(5)

# these loans are from 2009-2010
df[df.CreditRating.isnull()][['LoanOriginationDate', 'CreditGrade', 'ProsperRating']].describe()

# the other two ratings columns will not be needed anymore
df.drop(['CreditGrade', 'ProsperRating'], axis=1, inplace=True)

It will also be useful to create another rating variable, which will group scores into a few riskiness categories.

# create new variable CreditRating which combines the two rating columns
df['CreditRisk'] = np.where(df.CreditRating.isin(['AA', 'A']), 'low',
                           np.where(df.CreditRating.isin(['B', 'C', 'D']), 'medium',
                                    np.where(df.CreditRating.isin(['E', 'HR', 'NC']), 'high', None)))

# test
df[['CreditRating', 'CreditRisk']].sample(5)

# test on NaN
df[df.CreditRating.isnull()][['CreditRating', 'CreditRisk']].sample(5)

df.CreditRisk.isnull().sum()

131

# CreditRisk as ordered categorical var
ordered_var = pd.api.types.CategoricalDtype(ordered=True, categories = ['low', 'medium', 'high'])
df['CreditRisk'] = df['CreditRisk'].astype(ordered_var)

sb.countplot(data = df, x = 'CreditRisk', palette = 'RdYlBu_r')
plt.xlabel('Credit Risk')
plt.ylabel('Number of loans');

There are similar amounts of low and high risk loans with the majority if all loans being of medium risk.

CreditScoreRangeLower and CreditScoreRangeUpper¶

We will look at these two variables together to compare the gap between them.

plt.figure(figsize=(13,4))
plt.suptitle("Credit Score Range")
plt.subplot(1,2,1)
bins_lower = np.arange(0, df.CreditScoreRangeLower.max()+20, 20)
plt.hist(data = df, x = 'CreditScoreRangeLower', bins = bins_lower)
plt.xlabel('CreditScoreRangeLower')
plt.subplot(1,2,2)
bins_upper = np.arange(0, df.CreditScoreRangeUpper.max()+20, 20)
plt.hist(data = df, x = 'CreditScoreRangeUpper', bins = bins_upper)
plt.xlabel('CreditScoreRangeUpper');

df[['CreditScoreRangeLower', 'CreditScoreRangeUpper']].describe()

The two distributions are very similar and this is because the range buckets are not large (they are of size 20) and here are all the existing buckets listed:

df.groupby(['CreditScoreRangeLower', 'CreditScoreRangeUpper'])['LoanOriginationDate'].count()

CreditScoreRangeLower  CreditScoreRangeUpper
0.0                    19.0                       133
360.0                  379.0                        1
420.0                  439.0                        5
440.0                  459.0                       36
460.0                  479.0                      141
480.0                  499.0                      346
500.0                  519.0                      554
520.0                  539.0                     1593
540.0                  559.0                     1474
560.0                  579.0                     1357
580.0                  599.0                     1125
600.0                  619.0                     3601
620.0                  639.0                     4172
640.0                  659.0                    12099
660.0                  679.0                    16176
680.0                  699.0                    16315
700.0                  719.0                    15315
720.0                  739.0                    12796
740.0                  759.0                     9211
760.0                  779.0                     6566
780.0                  799.0                     4607
800.0                  819.0                     2636
820.0                  839.0                     1408
840.0                  859.0                      567
860.0                  879.0                      212
880.0                  899.0                       27
Name: LoanOriginationDate, dtype: int64

All information will be kept even when we drop one of the columns and work just with the other one:

df.drop('CreditScoreRangeUpper', axis = 1, inplace = True)

IncomeRange¶

sb.countplot(data = df, y = 'IncomeRange', color = base_color)
plt.xticks(rotation = 0);

We can see two large groups in the loans: mid earners on the one hand (25k - 75k) making the majority of loans, and high earners group (75k+) on the other.

IncomeVerifiable¶

df.IncomeVerifiable.value_counts().plot(kind='pie', autopct='%1.0f%%');

Majority of loans have borrowers with a verifiable income.

StatedMonthlyIncome¶

# monthly income is very right skewed
bins = np.arange(0, 30500, 500)
plt.hist(data = df, x = 'StatedMonthlyIncome', bins = bins)
plt.xlabel('Stated monthly income')
plt.xlim((0,30000));

log_binsize = 0.07
bins = 10 ** np.arange(2.5, np.log10(df['StatedMonthlyIncome'].max())+log_binsize, log_binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = df, x = 'StatedMonthlyIncome', bins = bins)
plt.xscale('log')
plt.xticks([1000, 3000, 5000, 10000, 20000, 40000], ['1k', '3k', '5k', '10k', '20k', '40k'])
plt.xlim((0,80000))
plt.xlabel('Stated monthly income')
plt.show()

Given the very skewed distribution of stated monthly income, we transformed the scale. The distribution looks normal after the transformation.

df.StatedMonthlyIncome.describe()

count    1.130640e+05
mean     5.605142e+03
std      7.495656e+03
min      0.000000e+00
25%      3.199521e+03
50%      4.666667e+03
75%      6.825000e+03
max      1.750003e+06
Name: StatedMonthlyIncome, dtype: float64

Some borrowers have very high stated monthly income of over 90k.
These outliers might be errors, but we cannot be sure, so we don't delete them.

df[df.StatedMonthlyIncome > 90000][['IncomeVerifiable', 'CreditRisk', 'LoanStatus', 'EmploymentStatus', 'IncomeRange']].sample(5)

df.StatedMonthlyIncome.value_counts().nlargest(5)

4166.666667    3486
5000.000000    3367
3333.333333    2889
3750.000000    2399
5416.666667    2351
Name: StatedMonthlyIncome, dtype: int64

DebtToIncomeRatio¶

plt.figure(figsize=(13,4))
plt.suptitle('Debt to income ratio distribution')
plt.subplot(1,2,1)
bins = np.arange(0, 10.5, 0.5)
plt.hist(data = df, x = 'DebtToIncomeRatio', bins = bins)
plt.xlabel('Debt to Income ratio')

plt.subplot(1,2,2)
bins = np.arange(0, 2.05, 0.05)
plt.hist(data = df, x = 'DebtToIncomeRatio', bins = bins)
plt.xlim((0,2))
plt.xlabel('Debt to Income ratio');

df.DebtToIncomeRatio.describe()

count    104593.000000
mean          0.276032
std           0.553740
min           0.000000
25%           0.140000
50%           0.220000
75%           0.320000
max          10.010000
Name: DebtToIncomeRatio, dtype: float64

Debt to income ratio is very right skewed with bulk of values around 0.2. Let's transform the scale to look at the distribution in detail:

log_binsize = 0.05
bins = 10 ** np.arange(-1.5, np.log10(df['DebtToIncomeRatio'].max())+log_binsize, log_binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = df, x = 'DebtToIncomeRatio', bins = bins)
plt.xscale('log')
plt.xticks([0.05, 0.1, 0.2, 0.3, 0.5, 0.8, 1.2], [0.05, 0.1, 0.2, 0.3, 0.5, 0.8, 1.2])
plt.xlim((0,5))
plt.xlabel('Debt to Income ratio')
plt.ylabel('Number of loans')
plt.show()

EmploymentStatus¶

plt.figure(figsize=[9, 5])
status_counts = df['EmploymentStatus'].value_counts()
status_order = status_counts.index
max_count = status_counts[0]
max_prop = max_count / df.shape[0]
tick_props = np.arange(0, max_prop, 0.1)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]

g = sb.countplot(data = df, x = 'EmploymentStatus', color = base_color, order = status_order)
plt.xticks(rotation = 0)
plt.xlabel('Employment Status')
plt.ylabel('Number of loans')


# values must be ordered from the largest for this to work
for i in range(status_counts.shape[0]):
    count = status_counts[i]
    pct_string = '{:0.1f}%'.format(100 * count / df.shape[0])
    plt.text(i, count+10, pct_string, ha = 'center')

plt.tight_layout();

Majority of borrowers are employed.

IsBorrowerHomeowner¶

df.IsBorrowerHomeowner.value_counts().plot(kind='pie', autopct='%1.0f%%');

Exactly half of loans belongs to a borrower who is a home owner.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

I created a new credit rating variable. This was done by merging two existing rating variables which were mutually exclusive by the loan origination date. I further created a credit risk variable which categorizes the rating into three groups: low, medium, or high risk.
Loan original amount is multimodal, with the most popular loaned amounts being 4k, 15k and 10k.
From the two similar interest rates variables, I chose the borrower annual percentage rate.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

I deleted two loans which had their closed date earlier than the loan origination date.
I reduced the number of income ranges by one by merging USD 0 with Not Employed into the same category.
I also created an active loans dummy to be able to quickly filter all active or non-active loan statuses from the data.
I further looked at a couple of right skewed variables using a logarithmic scale to better see where the bulk of the data points lie.

Bivariate Exploration¶

numeric_vars = ['BorrowerAPR', 'LoanOriginalAmount', 'MonthlyLoanPayment', 'CreditScoreRangeLower',
                'StatedMonthlyIncome', 'DebtToIncomeRatio', 'MonthsToClosed', 'Term']
bool_vars = ['LoanStatusActive', 'IncomeVerifiable', 'IsBorrowerHomeowner']
ordered_categoric_vars = ['IncomeRange', 'CreditRisk', 'CreditRating']

Numeric variables¶

# pairplot of numeric vars to see relationships overview on a sample of data for faster rendering
sample = np.random.choice(df.shape[0], 500, replace = False)
df_samp = df.iloc[sample,:]
sb.pairplot(df_samp, vars = numeric_vars);

# correlations
plt.figure(figsize = [10, 7])
sb.heatmap(df[['BorrowerAPR', 'LoanOriginalAmount', 'MonthlyLoanPayment', 'CreditScoreRangeLower', 'StatedMonthlyIncome', \
              'DebtToIncomeRatio', 'MonthsToClosed', 'Term', 'LoanStatusActive', 'IncomeVerifiable', 'IsBorrowerHomeowner']]\
           .corr(), annot = True, fmt = '.2f',
           cmap = 'vlag_r', center = 0)
plt.show()

The dependent variables of our interest are BorrowerAPR, the borrower annual percentage rate, and LoanOriginalAmount, i.e. the borrowed amount.
Based on the correlations, we will look into the following relationships of dependent variables:

BorrowerAPR - LoanOriginalAmount
BorrowerAPR - MonthlyLoanPayment
BorrowerAPR - CreditScoreRangeLower
LoanOriginalAmount - MonthlyLoanPayment
LoanOriginalAmount - CreditScoreRangeLower

Further, we see that borrowers with a verifiable income tend to have a lower debt to income ratio and that active loans are on average for a longer period, higher amounts and higher monthly payments than non-active loans.

Relationships of interest:

# a sample of 5000 from the data
sample = np.random.choice(df.shape[0], 5000, replace = False)
df_samp = df.iloc[sample,:]

plt.figure(figsize = (20,10))
plt.subplot(2,2,1)
g = plt.scatter(data = df_samp, x = 'LoanOriginalAmount', y = 'BorrowerAPR', alpha = 1/10)
plt.xscale('log')
plt.xlabel('Loan Original Amount (log scale)')
plt.ylabel('Borrower APR')
plt.ylim((0,0.45))
plt.xticks([1000, 2000, 5000, 10000, 20000], [1000, 2000, 5000, 10000, 20000])
# display the correlation
ax = plt.gca()
coef = df[['LoanOriginalAmount', 'BorrowerAPR']].corr().iloc[1,0]
label = r'$\rho$ = ' + str(round(coef, 2))
ax.annotate(label, xy = (0.75, 0.9), size = 20, xycoords = ax.transAxes)

plt.subplot(2,2,2)
plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'BorrowerAPR', alpha = 1/10)
plt.xscale('log')
plt.xlabel('Monthly Loan Payment (log scale)')
plt.ylabel('Borrower APR')
plt.xticks([10, 30, 100, 300, 800], [10, 30, 100, 300, 800])
plt.xlim((0,2000))
plt.ylim((0,0.45))
# display the correlation
ax = plt.gca()
coef = df[['MonthlyLoanPayment', 'BorrowerAPR']].corr().iloc[1,0]
label = r'$\rho$ = ' + str(round(coef, 2))
ax.annotate(label, xy = (0.75, 0.9), size = 20, xycoords = ax.transAxes)

plt.subplot(2,2,3)
plt.scatter(data = df_samp, x = 'CreditScoreRangeLower', y = 'BorrowerAPR', alpha = 1/10)
plt.xlabel('Credit Score Range Lower')
plt.ylabel('Borrower APR')
plt.xlim((400, 900))
plt.ylim((0,0.45))
# display the correlation
ax = plt.gca()
coef = df[['CreditScoreRangeLower', 'BorrowerAPR']].corr().iloc[1,0]
label = r'$\rho$ = ' + str(round(coef, 2))
ax.annotate(label, xy = (0.75, 0.9), size = 20, xycoords = ax.transAxes)

plt.suptitle('Borrower APR strongest relationships', size = 25);

The negative correlations of BorrowerAPR and loan original amount, monthly loan payment and credit score range lower are somewhat visible from the scatter plots and they all make logical sense.

plt.figure(figsize = (15,5))
plt.subplot(1,2,1)
plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'LoanOriginalAmount', alpha = 1/10)
plt.ylabel('Loan Original Amount')
plt.xlabel('Monthly Loan Payment')
plt.xlim((0,1500))
# display the correlation
ax = plt.gca()
coef = df[['MonthlyLoanPayment', 'LoanOriginalAmount']].corr().iloc[1,0]
label = r'$\rho$ = ' + str(round(coef, 2))
ax.annotate(label, xy = (0.1, 0.85), size = 20, xycoords = ax.transAxes)

plt.subplot(1,2,2)
plt.scatter(data = df_samp, x = 'CreditScoreRangeLower', y = 'LoanOriginalAmount', alpha = 1/10)
plt.xlabel('Credit Score Range Lower')
plt.ylabel('Loan Original Amount')
plt.xlim((400,900))
# display the correlation
ax = plt.gca()
coef = df[['CreditScoreRangeLower', 'LoanOriginalAmount']].corr().iloc[1,0]
label = r'$\rho$ = ' + str(round(coef, 2))
ax.annotate(label, xy = (0.1, 0.85), size = 20, xycoords = ax.transAxes)
plt.suptitle('Loan Original Amount strongest relationships', size = 20);

The positive correlation of loan original amount with monthly loan payment and credit score range lower is clear from the plots and makes logical sense, too.
Displaying monthly payment on a logarithmic scale shows the quadratic relationship with loan original amount.

plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'LoanOriginalAmount', alpha = 1/10)
plt.ylabel('Loan original amount')
plt.xlabel('Monthly loan payment')
plt.xlim((20,2000))
plt.xscale('log')
plt.xticks([90, 173, 400, 1e3, 2e3], [90, 173, 400, '1k', '2k']);

Next, let's look at relationships of categorical variables with the dependent variables.

Categorical variables¶

# sample of 2000
samples = np.random.choice(df.shape[0], 1000, replace = False)
df_samp = df.iloc[samples,:]

def boxgrid(x, y, **kwargs):
    """ Function for creating box plots with seaborn's PairGrid. """
    default_color = sb.color_palette()[0]
    sb.boxplot(x, y, color = default_color)

plt.figure(figsize = [15, 10])
g = sb.PairGrid(data = df_samp, x_vars = ['BorrowerAPR', 'LoanOriginalAmount'], y_vars = ordered_categoric_vars,
                size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();

<Figure size 1080x720 with 0 Axes>

plt.figure(figsize = (15,5))
plt.suptitle('Interest rate differentiation by credit risk')

plt.subplot(1,2,1)
# subplots do not work when using catplot+violin kind
sb.violinplot(data=df_samp, x="CreditRisk", y="BorrowerAPR", palette = 'RdYlBu_r')
sb.despine()

plt.subplot(1,2,2)
bins = np.arange(0.05, 0.45,0.01)
plt.hist(df_samp[df_samp['CreditRisk']=='low']['BorrowerAPR'], alpha = 0.5, color = sb.color_palette('RdYlBu_r')[0], bins = bins)
plt.hist(df_samp[df_samp['CreditRisk']=='medium']['BorrowerAPR'], alpha = 0.5, color = sb.color_palette('RdYlBu_r')[3], bins = bins)
plt.hist(df_samp[df_samp['CreditRisk']=='high']['BorrowerAPR'], alpha = 0.5, color = sb.color_palette('RdYlBu_r')[5], bins = bins)
sb.despine()
plt.xlabel('Borrower APR')
plt.ylabel('Number of loans')

plt.show();

plt.figure(figsize = [10, 20])
g = sb.PairGrid(data = df_samp, x_vars = ['BorrowerAPR', 'LoanOriginalAmount'], y_vars = ['EmploymentStatus', 'LoanStatus'],
                size = 4, aspect = 1.5)
g.map(boxgrid)
plt.show();

<Figure size 720x1440 with 0 Axes>

plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = df_samp, y_vars = ['BorrowerAPR', 'LoanOriginalAmount'], x_vars = bool_vars,
                size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();

<Figure size 720x720 with 0 Axes>

g = sb.FacetGrid(data = df, col = 'IncomeVerifiable')
g.map(sb.distplot, 'BorrowerAPR');

Borrowers with a verifiable income benefit from lower rate.

g = sb.FacetGrid(data = df, col = 'IsBorrowerHomeowner')
g.map(sb.distplot, 'LoanOriginalAmount');

Borrowers who own home get higher loan amounts.

plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = df_samp, y_vars = ['BorrowerAPR', 'LoanOriginalAmount'], x_vars = numeric_vars[-1:],
                size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show()

<Figure size 720x720 with 0 Axes>

plt.figure(figsize = (15,5))
plt.subplot(1,2,1)
g1 = sb.violinplot(data=df_samp, x="Term", y="LoanOriginalAmount", inner='stick', palette = 'pastel')

plt.subplot(1,2,2)
g2 = sb.barplot(data = df, x = 'Term', y = 'LoanOriginalAmount', palette = 'pastel')
plt.ylabel('Avg loan original amount')
plt.show();

sb.catplot(x="Term", y="BorrowerAPR", kind="violin", inner='stick', data=df_samp, palette = 'pastel');

sb.countplot(data = df, x = 'IsBorrowerHomeowner', hue = 'CreditRisk', palette = 'RdYlBu_r');

There are more high risk borrowers among those who do not own a house.

# most loans are for 36m and the most common rating is C
sb.countplot(data = df, x = 'CreditRating', hue = 'Term', palette = 'pastel');

The above box plots reveal the following (none of which is actually too surprising and it is what we would expect from a reasonable lending strategy):

Interest rate decreases and borrowed amount increases with higher income of the borrower
The more risky the loan, the higher the interest rate and the lower the borrowed amount
Loans with a higher interest default more (active loans: past due statuses vs. current loans, non-active loans: defaulted/chargedoff vs. completed loans)
Active loans are for larger loan amounts than non-active loans
Home owners pay a lower interest and borrow higher amounts
Unemployed borrowers pay the highest interest
Employed borrowers borrow the highest amounts
Higher loan amounts are borrowed for longer term

Let's very briefly also look at other numeric vars than BorrowerAPR and LoanOriginalAmount:

plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = df_samp, y_vars = ['DebtToIncomeRatio', 'MonthlyLoanPayment'], x_vars = ['CreditRisk', 'Term'],
                size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show()

<Figure size 720x720 with 0 Axes>

The squeezed debt to income ration box plots tell us that there are many outliers in the data with very high ratios.
High risk borrowers have lower monthly payments compared to other risk categories.

Other categorical variables¶

plt.figure(figsize = [10, 15])
g = sb.PairGrid(data = df_samp, y_vars = ['BorrowerAPR', 'LoanOriginalAmount'], x_vars = 'ListingCategory',
                size = 5, aspect = 1.5)
g.map(boxgrid)
plt.xticks(rotation = 0)
plt.show();

<Figure size 720x1080 with 0 Axes>

Many of the listing category groups are small, hence the long whiskers.

plt.figure(figsize = [10,12])

plt.subplot(3,1,1)
sb.countplot(data = df, x = 'EmploymentStatus', hue = 'CreditRisk', palette = 'RdYlBu_r')

plt.subplot(3,1,2)
sb.countplot(data = df, x = 'EmploymentStatus', hue = 'Term', palette = 'pastel')

plt.subplot(3,1,3)
sb.countplot(data = df, x = 'EmploymentStatus', hue = 'IsBorrowerHomeowner', palette = 'colorblind')

plt.tight_layout();

employed borrowers get loans with term 60 much more frequently than other groups
employed borrowers are the only group with strongly more home-owners among them

plt.figure(figsize = [10,12])

plt.subplot(3,1,1)
sb.countplot(data = df, x = 'IncomeRange', hue = 'CreditRisk', palette = 'RdYlBu_r')

plt.subplot(3,1,2)
sb.countplot(data = df, x = 'IncomeRange', hue = 'Term', palette = 'pastel')

plt.subplot(3,1,3)
sb.countplot(data = df, x = 'IncomeRange', hue = 'IsBorrowerHomeowner', palette = 'colorblind')

plt.tight_layout();

there are high risk borrowers even among the high earners
but the share of low risk borrowers among the top earners is of course higher than in other income groups

Development in time¶

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR')
plt.title('Borrower APR development in time')
plt.ylabel('Borrower APR')
plt.xlabel('Loan origination year')
plt.ylim((0,0.3));

average interest rate is in decline since 2012

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount')
plt.title('Loan original amount development in time')
plt.ylabel('Average loan original amount')
plt.xlabel('Loan origination year');

average loan amoounts are increasing since 2010

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'IncomeVerifiable')
plt.ylim((0,1.2));

share of borrowers with verified income is slightly decreasing in time

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'IsBorrowerHomeowner');

share of borrowers who own a home is quite steady since 2010

sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'IsBorrowerHomeowner', palette = 'colorblind');

the same information can be shown also this way

sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'Term', palette = 'pastel')
plt.ylabel('Number of loans')
plt.legend(title = 'Term', loc = 'upper left');

60m term was introduced first in 2010

plt.figure(figsize = [10,8])
sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'CreditRating', palette = 'RdYlBu_r');

plt.figure(figsize = [10,8])
sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'CreditRisk', palette = 'RdYlBu_r')
plt.xlabel('Loan origination year')
plt.ylabel('Number of loans')
plt.title('Loans distribution by credit risk');

plt.figure(figsize = [10,12])
sb.countplot(data = df, x = df['LoanOriginationDate'].dt.year, hue = 'EmploymentStatus')
plt.ylim((0,6000));

employed category was introduced in 2010, apparently as a successor of full-time category, but the full-time number is decreasing gradually in the following years
not available category was only in 2006 and 2007

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

Borrower APR is negatively correlated with the loan original amount, monthly loan payment and credit score range lower, all of which makes sense. If you can borrow larger amounts, you are likely a more solvent borrower, therefore you get a lower interest rate and you have a higher credit score and can afford and be allowed to pay higher monthly payments.
This interconnectedness of the main features is also shown by the positive correlation between loan original amount with monthly loan payment and with credit score.
The credit rating clearly separates borrowers into different interest rate groups, which is exactly what we would expect from a reasonable lending business. Many of the variables included in the full dataset are inputs for calculating the credit rating, some of which we are still investigating in this analysis. The purpose of all this data about borrowers should lead to an appropriate rating based on which he can get an appropriate interest rate and other loan details (amount, monthly payment, term). High risk borrowers for example get loans for much lower amounts than medium or low risk borrowers.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

Higher income borrowers and home owners borrow higher amounts for lower rates compared with lower income borrowers.
Loans with a higher interest default more often.
Unemployed borrowers pay the highest interest among all employment statuses.
Higher loan amounts are borrowed for longer term.

Multivariate Exploration¶

This section will build on previous findings and combine multiple variables to see how they interact.
A few sections will follow, each focusing on one variable and its relationship with others.

Credit risk¶

# Pair plot colored by credit risk
sb.pairplot(df_samp, 
             vars = ['BorrowerAPR', 'LoanOriginalAmount', 'MonthlyLoanPayment', 'CreditScoreRangeLower'], 
             hue = 'CreditRisk', diag_kind = 'kde', palette = 'RdYlBu_r',
             plot_kws = {'alpha': 0.6, 's': 90},
             size = 4);

Credit risk groups are clearly separated in our featured variables.

g = sb.FacetGrid(data = df_samp, col = 'CreditRisk', hue = 'Term',
                size = 3, aspect = 1.5)
g.map(sb.scatterplot, 'LoanOriginalAmount', 'BorrowerAPR')
g.add_legend()
plt.show();

The rising interest rate among risk groups is visible, as well as the 60m loans having a rather higher interest rate than 36m loans on average in all risk groups.

g = sb.FacetGrid(data = df_samp, col = 'CreditRisk', hue = 'Term',
                size = 3, aspect = 1.5)
g.map(sb.scatterplot, 'LoanOriginalAmount', 'MonthlyLoanPayment')
g.add_legend()
plt.show()

It is clear that for the same loan amount, the monthly payments differ mostly by the term of the loan.
It shows also that we have most borrowers in the medium risk category and that high risk borrowers cannot borrow very high loan amounts.

g = sb.FacetGrid(data = df_samp, col = 'CreditRisk', row = 'IsBorrowerHomeowner', hue = 'Term',
                size = 3, aspect = 1.5)
g.map(sb.scatterplot, 'MonthlyLoanPayment', 'BorrowerAPR', alpha = 0.7)
g.add_legend()
plt.show();

Owning a home does not seem to make a huge difference in interest rate and monthly payments when separating by both credit risk and loan term.
However, we can see that in low risk group, there are more homeowners that non-home owners, whereas the opposite is true for the high risk group.

# it is interesting how the 60m loans appear in the center of distributions
g = sb.FacetGrid(data = df_samp, col = 'CreditRisk', hue = 'Term',
                size = 3, aspect = 1.5)
g.map(sb.scatterplot, 'DebtToIncomeRatio', 'BorrowerAPR', alpha = 0.7)
plt.xlim((0,2))
g.add_legend()
plt.show();

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'CreditRisk', palette = 'RdYlBu_r');

df_rate_dev = df.groupby(['LoanYear', 'CreditRisk'])['BorrowerAPR'].mean()
df_rate_dev = pd.DataFrame(df_rate_dev) # created with multiindex
df_rate_dev.reset_index(inplace=True)

fig = px.bar(df_rate_dev,
             y="CreditRisk",
             x="BorrowerAPR",
             animation_frame="LoanYear",
             animation_group="BorrowerAPR",
             orientation='h',
             color="CreditRisk",
             range_x=[0, df_rate_dev.BorrowerAPR.max()+0.05],
             color_continuous_scale=px.colors.diverging.RdYlBu)
# improve aesthetics (size, grids etc.)
fig.update_layout(width=700,
                  height=500,
                  xaxis_showgrid=False,
                  yaxis_showgrid=False,
                  title_text='Evolution of Borrower APR by credit risk',
                  showlegend=False)

fig.update_xaxes(title_text='Average Borrower APR')
fig.update_yaxes(title_text='Credit Risk')
fig.show()

The interest rate differentiation for the risk group is very clear and quite large.
The average interest rate is declining for high and medium risk groups in the last couple of years, but this is not the case for low risk group.

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount', hue = 'CreditRisk', palette = 'RdYlBu_r');

the riskier the borrower, the lower the amount which he can borrow and this holds true over time

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'IsBorrowerHomeowner', hue = 'CreditRisk', palette = 'RdYlBu_r');

the riskier the borrower, the less likely he is to own a house in general and this holds true over the years

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'IncomeVerifiable', hue = 'CreditRisk', palette = 'RdYlBu_r');

There is an interesting shift where in the beginning of business, the riskiest borrowers apparently had to have a verified income more often than the safer borrower groups, but this is no longer the case since 2009 and the high risk group also has the lowest share of verifiable income.

There are very clear differences among the credit risk groups, which is a good sign. Let's follow with looking if this still holds as clearly in the more detailed credit rating variable.

Credit rating¶

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'CreditRating', palette = 'RdYlBu_r')
plt.legend(loc = 'right', bbox_to_anchor=(1.25, 0.5), ncol=1);

the difference in average interest rate in the worst ratings E and HR seems to be notable only in the last 3 years

plt.figure(figsize=(15,5))
sb.barplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount', hue = 'CreditRating', palette = 'RdYlBu_r')
plt.legend(loc = 'right', bbox_to_anchor=(1.1, 0.5), ncol=1);

loan amounts increase in time across categories, most notably the loan original amounts in C rating are getting closer to those of higher ratings

sb.catplot(x = 'CreditRating', y = 'BorrowerAPR', hue = 'Term', kind = 'bar', data = df_samp, size = 6, palette = 'pastel');

for AA and A ratings, 60m loans have the highest interest rate
for B ratings, 12m and 60m bear a very similar interest rate
for worse than B ratings, 12m loan is actually more expensive than 60m loan
the worst rating can only borrow for 36 months

sb.catplot(x = 'CreditRating', y = 'LoanOriginalAmount', hue = 'Term', kind = 'bar', data = df_samp, size = 6,\
           palette = 'pastel');

for 36m loan, the most common term, it is very clear that the worse the rating, the lower the borrowed amount

Is Borrower a Homeowner¶

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'IsBorrowerHomeowner', \
             palette = 'colorblind');

borrowers who do not own a house pay a higher interest and this holds throughout the years

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount', hue = 'IsBorrowerHomeowner', \
             palette = 'colorblind');

borrowers with a house borrow more money on average and this is true in all years

g = sb.FacetGrid(data = df_samp, hue = 'IsBorrowerHomeowner', size = 6, palette = 'colorblind')
g.map(plt.scatter, 'LoanOriginalAmount', 'BorrowerAPR', alpha = 0.6)
g.add_legend()
plt.show()

home owners more frequently borrow higher amounts

sb.catplot(x="IsBorrowerHomeowner", y="BorrowerAPR", hue="CreditRisk",
            kind="violin", inner=None,
            palette="RdYlBu_r", data=df_samp);

once assigned a risk level, a home ownership does not make too much difference in the average interest rate in that risk group

# the same is also visible here
sb.catplot(x="CreditRisk", y="BorrowerAPR", hue="IsBorrowerHomeowner",
            kind="violin", inner=None, split=True,
            palette="colorblind", data=df_samp);

sb.catplot(x="CreditRisk", y="LoanOriginalAmount", hue="IsBorrowerHomeowner",
            kind="violin", inner=None, split=True,
            palette="colorblind", data=df_samp);

there are more non-home owners borrowing lower amounts (< 10k) in low and medium risk groups than home-owners

Income range¶

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'IncomeRange', palette = 'RdYlBu',\
             linestyles = '--')
plt.legend(loc = 'right', bbox_to_anchor=(1.45, 0.5), ncol=1);

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'LoanOriginalAmount', hue = 'IncomeRange', \
             palette = 'RdYlBu', linestyles = '--')
plt.legend(loc = 'right', bbox_to_anchor=(1.45, 0.5), ncol=1);

sb.catplot(y = 'IncomeRange', x = 'BorrowerAPR', hue = 'CreditRisk', kind = 'box', data = df_samp, size = 6, palette = 'RdYlBu_r');

once assigned into a risk category, income range does not seem to matter much (it was one of the inputs for deciding about the the risk category, so if someone is high risk even with 100k+ income, then this income does not give him lower rates anymore)

sb.catplot(y = 'IncomeRange', x = 'LoanOriginalAmount', hue = 'CreditRisk', kind = 'box', data = df_samp, size = 6, palette = 'RdYlBu_r');

sb.catplot(y = 'IncomeRange', x = 'BorrowerAPR', hue = 'Term', kind = 'box', data = df_samp, size = 6, palette = 'pastel');

sb.catplot(y = 'IncomeRange', x = 'BorrowerAPR', hue = 'IsBorrowerHomeowner', kind = 'box', data = df_samp, size = 6, \
           palette = 'colorblind');

Borrower APR and Loan Original Amount¶

plt.scatter(data = df_samp, x = 'LoanOriginalAmount', y = 'BorrowerAPR', alpha = 1/2, c = 'MonthlyLoanPayment', \
            cmap = 'viridis_r')
plt.colorbar(label = 'Monthly Loan Payment')
plt.ylabel('Borrower APR')
plt.xlabel('Loan Original Amount')
plt.show()

higher loan amounts have lower interest rate and higher monthly payments

plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'BorrowerAPR', alpha = 1/2, c = 'LoanOriginalAmount', \
            cmap = 'viridis_r')
plt.colorbar(label = 'Loan Original Amount')
plt.ylabel('Borrower APR')
plt.xlabel('Monthly Loan Payment')
plt.xlim((20,2000))
plt.xscale('log')
plt.xticks([90, 173, 400, 1e3, 2e3], [90, 173, 400, '1k', '2k'])
plt.show()

plt.scatter(data = df_samp, x = 'MonthlyLoanPayment', y = 'LoanOriginalAmount', alpha = 1/2, c = 'BorrowerAPR', \
            cmap = 'viridis_r')
plt.colorbar(label = 'BorrowerAPR')
plt.ylabel('Loan original amount')
plt.xlabel('Monthly Loan Payment')
plt.xlim((20,2000))
plt.xscale('log')
plt.xticks([90, 173, 400, 1e3, 2e3], [90, 173, 400, '1k', '2k'])
plt.show()

# borrowers with verifiable income borrow higher amounts more often
g = sb.FacetGrid(data = df_samp, col = 'IncomeVerifiable',
                size = 3, aspect = 1.5)
g.map(sb.scatterplot, 'LoanOriginalAmount', 'BorrowerAPR')
plt.show()

Employment status¶

sb.pointplot(data = df, x = df['LoanOriginationDate'].dt.year, y = 'BorrowerAPR', hue = 'EmploymentStatus',\
             markers = ['.','s','*','D','x','o','P','d'], color = 'blue', errwidth=0, linestyles='')
plt.legend(loc = 'right', bbox_to_anchor=(1.45, 0.5), ncol=1);

sb.catplot(y = 'EmploymentStatus', x = 'BorrowerAPR', hue = 'CreditRisk', order = \
           ['Full-time','Employed','Part-time','Self-employed','Retired','Other','Not employed','Not available'], \
           kind = 'box', data = df_samp, size = 6, palette = 'RdYlBu_r');

sb.catplot(y = 'EmploymentStatus', x = 'LoanOriginalAmount', hue = 'CreditRisk', order = \
           ['Full-time','Employed','Part-time','Self-employed','Retired','Other','Not employed','Not available'], \
           kind = 'box', data = df_samp, size = 6, palette = 'RdYlBu_r')
plt.xlim((0,30000));

sb.catplot(y = 'EmploymentStatus', x = 'BorrowerAPR', hue = 'IsBorrowerHomeowner', order = \
           ['Full-time','Employed','Part-time','Self-employed','Retired','Other','Not employed','Not available'], \
           kind = 'box', data = df_samp, size = 6, palette = 'colorblind');

Differences among employment statuses are not very significant or consistent. This combined with knowing that employment categories are very imbalanced as was described earlier prevents from making any solid conclusions.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

Credit risk category, as expected and hoped, is a major factor which segments the borrowers into clearly distinguishable groups.
Being assigned into one of the risk categories strongly determines the interest rate and the borrowed amount.
Borrowed amounts are increasing overall in the last few years, while the average interest rate is decreasing (especially for the medium and high risk groups).

Were there any interesting or surprising interactions between features?¶

House ownership or income range are also great for borrower segmentation, but only until combined with credit risk. This is because house ownership, income range and many other available variables feed into the credit score/credit rating/credit risk calculations and once a borrower is assigned his credit rating/credit risk, these variables are already accounted for and make no difference anymore.

	Term	LoanOriginalAmount	MonthlyLoanPayment	ListingCategory	BorrowerAPR	BorrowerRate	CreditScoreRangeLower	CreditScoreRangeUpper	StatedMonthlyIncome	DebtToIncomeRatio
count	113066.000000	113066.000000	113066.000000	113066.000000	113041.000000	113066.000000	112475.000000	112475.000000	1.130660e+05	104594.000000
mean	40.800170	8314.762307	271.932742	2.776838	0.218980	0.192946	685.524961	704.524961	5.605120e+03	0.276032
std	10.421518	6237.007841	192.549979	3.998188	0.080483	0.074917	66.635895	66.635895	7.495596e+03	0.553738
min	12.000000	1000.000000	0.000000	0.000000	0.006530	0.000000	0.000000	19.000000	0.000000e+00	0.000000
25%	36.000000	4000.000000	130.950000	1.000000	0.156290	0.134000	660.000000	679.000000	3.199396e+03	0.140000
50%	36.000000	6300.000000	217.370000	1.000000	0.209840	0.184000	680.000000	699.000000	4.666667e+03	0.220000
75%	36.000000	12000.000000	370.570000	3.000000	0.283860	0.250600	720.000000	739.000000	6.824688e+03	0.320000
max	60.000000	35000.000000	2251.510000	20.000000	0.512290	0.497500	880.000000	899.000000	1.750003e+06	10.010000

	BorrowerAPR	BorrowerRate
count	113039.000000	113064.000000
mean	0.218982	0.192947
std	0.080483	0.074916
min	0.006530	0.000000
25%	0.156290	0.134000
50%	0.209840	0.184000
75%	0.283860	0.250600
max	0.512290	0.497500

	CreditScoreRangeLower	CreditScoreRangeUpper
count	112473.000000	112473.000000
mean	685.524882	704.524882
std	66.636364	66.636364
min	0.000000	19.000000
25%	660.000000	679.000000
50%	680.000000	699.000000
75%	720.000000	739.000000
max	880.000000	899.000000

	ListingKey	ListingNumber	ListingCreationDate	CreditGrade	Term	LoanStatus	ClosedDate	BorrowerAPR	BorrowerRate	LenderYield	EstimatedEffectiveYield	EstimatedLoss	EstimatedReturn	ProsperRating (numeric)	ProsperRating (Alpha)	ProsperScore	ListingCategory (numeric)	BorrowerState	Occupation	EmploymentStatus	EmploymentStatusDuration	IsBorrowerHomeowner	CurrentlyInGroup	GroupKey	DateCreditPulled	CreditScoreRangeLower	CreditScoreRangeUpper	FirstRecordedCreditLine	CurrentCreditLines	OpenCreditLines	TotalCreditLinespast7years	OpenRevolvingAccounts	OpenRevolvingMonthlyPayment	InquiriesLast6Months	TotalInquiries	CurrentDelinquencies	AmountDelinquent	DelinquenciesLast7Years	PublicRecordsLast10Years	PublicRecordsLast12Months	RevolvingCreditBalance	BankcardUtilization	AvailableBankcardCredit	TotalTrades	TradesNeverDelinquent (percentage)	TradesOpenedLast6Months	DebtToIncomeRatio	IncomeRange	IncomeVerifiable	StatedMonthlyIncome	LoanKey	TotalProsperLoans	TotalProsperPaymentsBilled	OnTimeProsperPayments	ProsperPaymentsLessThanOneMonthLate	ProsperPaymentsOneMonthPlusLate	ProsperPrincipalBorrowed	ProsperPrincipalOutstanding	ScorexChangeAtTimeOfListing	LoanFirstDefaultedCycleNumber	LoanMonthsSinceOrigination	LoanNumber	LoanOriginalAmount	LoanOriginationDate	LoanOriginationQuarter	MemberKey	MonthlyLoanPayment	LP_CustomerPayments	LP_CustomerPrincipalPayments	LP_InterestandFees	LP_ServiceFees	PercentFunded	Investors
0	1021339766868145413AB3B	193129	2007-08-26 19:09:29.263000000	C	36	Completed	2009-08-14 00:00:00	0.16516	0.1580	0.1380	NaN	NaN	NaN	NaN	NaN	NaN	0	CO	Other	Self-employed	2.0	True	True	NaN	2007-08-26 18:41:46.780000000	640.0	659.0	2001-10-11 00:00:00	5.0	4.0	12.0	1	24.0	3.0	3.0	2.0	472.0	4.0	0.0	0.0	0.0	0.00	1500.0	11.0	0.81	0.0	0.17	$25,000-49,999	True	3083.333333	E33A3400205839220442E84	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	78	19141	9425	2007-09-12 00:00:00	Q3 2007	1F3E3376408759268057EDA	330.43	11396.14	9425.00	1971.14	-133.18	1.0	258
1	10273602499503308B223C1	1209647	2014-02-27 08:28:07.900000000	NaN	36	Current	NaN	0.12016	0.0920	0.0820	0.07960	0.0249	0.05470	6.0	A	7.0	2	CO	Professional	Employed	44.0	False	False	NaN	2014-02-27 08:28:14	680.0	699.0	1996-03-18 00:00:00	14.0	14.0	29.0	13	389.0	3.0	5.0	0.0	0.0	0.0	1.0	0.0	3989.0	0.21	10266.0	29.0	1.00	2.0	0.18	$50,000-74,999	True	6125.000000	9E3B37071505919926B1D82	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	134815	10000	2014-03-03 00:00:00	Q1 2014	1D13370546739025387B2F4	318.93	0.00	0.00	0.00	0.00	1.0	1
2	0EE9337825851032864889A	81716	2007-01-05 15:00:47.090000000	HR	36	Completed	2009-12-17 00:00:00	0.28269	0.2750	0.2400	NaN	NaN	NaN	NaN	NaN	NaN	0	GA	Other	Not available	NaN	False	True	783C3371218786870A73D20	2007-01-02 14:09:10.060000000	480.0	499.0	2002-07-27 00:00:00	NaN	NaN	3.0	0	0.0	0.0	1.0	1.0	NaN	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.06	Not displayed	True	2083.333333	6954337960046817851BCB2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	86	6466	3001	2007-01-17 00:00:00	Q1 2007	5F7033715035555618FA612	123.32	4186.63	3001.00	1185.63	-24.20	1.0	41
3	0EF5356002482715299901A	658116	2012-10-22 11:02:35.010000000	NaN	36	Current	NaN	0.12528	0.0974	0.0874	0.08490	0.0249	0.06000	6.0	A	9.0	16	GA	Skilled Labor	Employed	113.0	True	False	NaN	2012-10-22 11:02:32	800.0	819.0	1983-02-28 00:00:00	5.0	5.0	29.0	7	115.0	0.0	1.0	4.0	10056.0	14.0	0.0	0.0	1444.0	0.04	30754.0	26.0	0.76	0.0	0.15	$25,000-49,999	True	2875.000000	A0393664465886295619C51	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	16	77296	10000	2012-11-01 00:00:00	Q4 2012	9ADE356069835475068C6D2	321.45	5143.20	4091.09	1052.11	-108.01	1.0	158
4	0F023589499656230C5E3E2	909464	2013-09-14 18:38:39.097000000	NaN	36	Current	NaN	0.24614	0.2085	0.1985	0.18316	0.0925	0.09066	3.0	D	4.0	2	MN	Executive	Employed	44.0	True	False	NaN	2013-09-14 18:38:44	680.0	699.0	2004-02-20 00:00:00	19.0	19.0	49.0	6	220.0	1.0	9.0	0.0	0.0	0.0	0.0	0.0	6193.0	0.81	695.0	39.0	0.95	2.0	0.26	$100,000+	True	9583.333333	A180369302188889200689E	1.0	11.0	11.0	0.0	0.0	11000.0	9947.9	NaN	NaN	6	102670	15000	2013-09-20 00:00:00	Q3 2013	36CE356043264555721F06C	563.97	2819.85	1563.22	1256.63	-60.27	1.0	20

	ListingKey	ListingNumber	ListingCreationDate	CreditGrade	Term	LoanStatus	ClosedDate	BorrowerAPR	BorrowerRate	LenderYield	EstimatedEffectiveYield	EstimatedLoss	EstimatedReturn	ProsperRating (numeric)	ProsperRating (Alpha)	ProsperScore	ListingCategory (numeric)	BorrowerState	Occupation	EmploymentStatus	EmploymentStatusDuration	IsBorrowerHomeowner	CurrentlyInGroup	GroupKey	DateCreditPulled	CreditScoreRangeLower	CreditScoreRangeUpper	FirstRecordedCreditLine	CurrentCreditLines	OpenCreditLines	TotalCreditLinespast7years	OpenRevolvingAccounts	OpenRevolvingMonthlyPayment	InquiriesLast6Months	TotalInquiries	CurrentDelinquencies	AmountDelinquent	DelinquenciesLast7Years	PublicRecordsLast10Years	PublicRecordsLast12Months	RevolvingCreditBalance	BankcardUtilization	AvailableBankcardCredit	TotalTrades	TradesNeverDelinquent (percentage)	TradesOpenedLast6Months	DebtToIncomeRatio	IncomeRange	IncomeVerifiable	StatedMonthlyIncome	LoanKey	TotalProsperLoans	TotalProsperPaymentsBilled	OnTimeProsperPayments	ProsperPaymentsLessThanOneMonthLate	ProsperPaymentsOneMonthPlusLate	ProsperPrincipalBorrowed	ProsperPrincipalOutstanding	ScorexChangeAtTimeOfListing	LoanCurrentDaysDelinquent	LoanFirstDefaultedCycleNumber	LoanMonthsSinceOrigination	LoanNumber	LoanOriginalAmount	LoanOriginationDate	LoanOriginationQuarter	MemberKey	MonthlyLoanPayment	LP_CustomerPayments	LP_CustomerPrincipalPayments	LP_InterestandFees	LP_ServiceFees	LP_CollectionFees	LP_GrossPrincipalLoss	LP_NetPrincipalLoss	LP_NonPrincipalRecoverypayments	PercentFunded	Recommendations	InvestmentFromFriendsCount	InvestmentFromFriendsAmount	Investors
8	0F043596202561788EA13D5	1023355	2013-12-02 10:43:39.117000000	NaN	36	Current	NaN	0.0762	0.0629	0.0529	0.05221	0.0099	0.04231	7.0	AA	9.0	7	IL	Food Service	Employed	269.0	True	False	NaN	2013-12-02 10:43:39	820.0	839.0	1993-04-01 00:00:00	17.0	16.0	32.0	12	219.0	1.0	6.0	0.0	0.0	0.0	0.0	0.0	9906.0	0.11	77696.0	29.0	1.0	1.0	0.25	$25,000-49,999	True	3333.333333	02163700809231365A56A1C	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	NaN	3	121268	10000	2013-12-12 00:00:00	Q4 2013	49A53699682291323D04D66	305.54	611.08	505.58	105.5	-16.77	0.0	0.0	0.0	0.0	1.0	0	0	0.0	1
9	0F043596202561788EA13D5	1023355	2013-12-02 10:43:39.117000000	NaN	36	Current	NaN	0.0762	0.0629	0.0529	0.05221	0.0099	0.04231	7.0	AA	11.0	7	IL	Food Service	Employed	269.0	True	False	NaN	2013-12-02 10:43:39	820.0	839.0	1993-04-01 00:00:00	17.0	16.0	32.0	12	219.0	1.0	6.0	0.0	0.0	0.0	0.0	0.0	9906.0	0.11	77696.0	29.0	1.0	1.0	0.25	$25,000-49,999	True	3333.333333	02163700809231365A56A1C	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	NaN	3	121268	10000	2013-12-12 00:00:00	Q4 2013	49A53699682291323D04D66	305.54	611.08	505.58	105.5	-16.77	0.0	0.0	0.0	0.0	1.0	0	0	0.0	1

	ListingKey	ListingNumber	ListingCreationDate	CreditGrade	Term	LoanStatus	ClosedDate	BorrowerAPR	BorrowerRate	LenderYield	EstimatedEffectiveYield	EstimatedLoss	EstimatedReturn	ProsperRating (numeric)	ProsperRating (Alpha)	ProsperScore	ListingCategory (numeric)	BorrowerState	Occupation	EmploymentStatus	EmploymentStatusDuration	IsBorrowerHomeowner	CurrentlyInGroup	GroupKey	DateCreditPulled	CreditScoreRangeLower	CreditScoreRangeUpper	FirstRecordedCreditLine	CurrentCreditLines	OpenCreditLines	TotalCreditLinespast7years	OpenRevolvingAccounts	OpenRevolvingMonthlyPayment	InquiriesLast6Months	TotalInquiries	RevolvingCreditBalance	BankcardUtilization	AvailableBankcardCredit	TotalTrades	TradesNeverDelinquent (percentage)	TradesOpenedLast6Months	DebtToIncomeRatio	IncomeRange	IncomeVerifiable	StatedMonthlyIncome	LoanKey	TotalProsperLoans	TotalProsperPaymentsBilled	OnTimeProsperPayments	ProsperPaymentsLessThanOneMonthLate	ProsperPaymentsOneMonthPlusLate	ProsperPrincipalBorrowed	ProsperPrincipalOutstanding	ScorexChangeAtTimeOfListing	LoanFirstDefaultedCycleNumber	LoanMonthsSinceOrigination	LoanNumber	LoanOriginalAmount	LoanOriginationDate	LoanOriginationQuarter	MemberKey	MonthlyLoanPayment	PercentFunded	Investors
998	0A0635972629771021E38F3	1055932	2013-12-18 14:25:17.053000000	NaN	60	Current	NaN	0.1771	0.1535	0.1435	0.13507	0.0599	0.07517	5.0	B	6.0	1	MD	Other	Employed	100.0	True	False	NaN	2014-02-08 11:39:13	760.0	779.0	1994-11-01 00:00:00	16.0	16.0	34.0	15	623.0	1.0	5.0	8028.0	0.27	14190.0	32.0	0.96	1.0	0.35	$25,000-49,999	True	3750.0	CB643706582969191E8C26B	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	132043	10000	2014-02-12 00:00:00	Q1 2014	98A937006186534818378B9	239.74	1.0	1
999	0A0635972629771021E38F3	1055932	2013-12-18 14:25:17.053000000	NaN	60	Current	NaN	0.1771	0.1535	0.1435	0.13507	0.0599	0.07517	5.0	B	7.0	1	MD	Other	Employed	100.0	True	False	NaN	2014-02-08 11:39:13	760.0	779.0	1994-11-01 00:00:00	16.0	16.0	34.0	15	623.0	1.0	5.0	8028.0	0.27	14190.0	32.0	0.96	1.0	0.35	$25,000-49,999	True	3750.0	CB643706582969191E8C26B	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	132043	10000	2014-02-12 00:00:00	Q1 2014	98A937006186534818378B9	239.74	1.0	1
57494	0A0635972629771021E38F3	1055932	2013-12-18 14:25:17.053000000	NaN	60	Current	NaN	0.1771	0.1535	0.1435	0.13507	0.0599	0.07517	5.0	B	8.0	1	MD	Other	Employed	100.0	True	False	NaN	2014-02-08 11:39:13	760.0	779.0	1994-11-01 00:00:00	16.0	16.0	34.0	15	623.0	1.0	5.0	8028.0	0.27	14190.0	32.0	0.96	1.0	0.35	$25,000-49,999	True	3750.0	CB643706582969191E8C26B	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1	132043	10000	2014-02-12 00:00:00	Q1 2014	98A937006186534818378B9	239.74	1.0	1

	LoanOriginationDate	ClosedDate	LoanStatus	Term	LoanOriginalAmount	MonthlyLoanPayment	ListingCategory	BorrowerAPR	BorrowerRate	CreditGrade	ProsperRating	CreditScoreRangeLower	CreditScoreRangeUpper	IncomeRange	IncomeVerifiable	StatedMonthlyIncome	DebtToIncomeRatio	EmploymentStatus	IsBorrowerHomeowner
LoanKey
E33A3400205839220442E84	2007-09-12	2009-08-14	Completed	36	9425	330.43	0	0.16516	0.158	C	NaN	640.0	659.0	$25,000-49,999	True	3083.333333	0.17	Self-employed	True
9E3B37071505919926B1D82	2014-03-03	NaT	Current	36	10000	318.93	2	0.12016	0.092	NaN	A	680.0	699.0	$50,000-74,999	True	6125.000000	0.18	Employed	False
6954337960046817851BCB2	2007-01-17	2009-12-17	Completed	36	3001	123.32	0	0.28269	0.275	HR	NaN	480.0	499.0	Not displayed	True	2083.333333	0.06	Not available	False

	LoanOriginationDate	ClosedDate	LoanStatus	Term	LoanOriginalAmount	MonthlyLoanPayment	ListingCategory	BorrowerAPR	BorrowerRate	CreditGrade	ProsperRating	CreditScoreRangeLower	CreditScoreRangeUpper	IncomeRange	IncomeVerifiable	StatedMonthlyIncome	DebtToIncomeRatio	EmploymentStatus	IsBorrowerHomeowner	LoanYear	MonthsToClosed
LoanKey
5BA53702830459166F4AD14	2013-12-31	2010-03-16	Defaulted	36	2500	82.31	1	0.14206	0.1139	NaN	A	660.0	679.0	$75,000-99,999	False	6250.000000	NaN	Self-employed	False	2013	-45.536869
BF6034334495990283E5285	2008-10-17	2008-10-16	Cancelled	36	1000	32.62	1	0.16219	0.1075	A	NaN	720.0	739.0	$25,000-49,999	True	2444.666667	0.34	Full-time	False	2008	-0.032855

	CreditGrade	ProsperRating	CreditRating
LoanKey
C18E361608562401325491D	NaN	B	B
898E36905220589760F1483	NaN	E	E
D49F370068074845932209D	NaN	E	E
6E50357259015325913E4BB	NaN	AA	AA
D86436994171560029D19DC	NaN	C	C
28C9340400229221010B206	E	NaN	E
893136977367241052D685B	NaN	AA	AA
AD153375061377785C48584	C	NaN	C
A74F3700100112559D4AF3D	NaN	AA	AA
8AD033737979402480586AB	C	NaN	C
34C433873014471597AB3E7	E	NaN	E
5DAA3698161759183A266D5	NaN	C	C
148F3430591185111307B85	B	NaN	B
D4AA37029930158805FDEC5	NaN	B	B
FD7B33644729348240DD2EE	D	NaN	D

	LoanOriginationDate	CreditGrade	ProsperRating	CreditRating
LoanKey
965C3574741528517B48FD1	2009-12-16	NaN	NaN	NaN
0BDA35626345884925A83EC	2009-07-28	NaN	NaN	NaN
9B3536037649888496B4961	2010-11-05	NaN	NaN	NaN
6D2535833186252175F08F8	2010-03-26	NaN	NaN	NaN
5B1E3565441542551870016	2009-08-20	NaN	NaN	NaN

	LoanOriginationDate	CreditGrade	ProsperRating
count	131	0	0
unique	99	0	0
top	2009-11-30 00:00:00	NaN	NaN
freq	8	NaN	NaN
first	2009-07-22 00:00:00	NaN	NaN
last	2010-12-30 00:00:00	NaN	NaN

	CreditRating	CreditRisk
LoanKey
D39434325241157906FA1B2	B	medium
07123584186072707FB3CDB	A	low
46CB34293810184059B03BE	A	low
B2D436152228842109ACABA	B	medium
CCE337036924778018C5147	A	low

	CreditRating	CreditRisk
LoanKey
3AAF3565209388267076BAE	NaN	None
6D2535833186252175F08F8	NaN	None
0E3D35804630391047D2500	NaN	None
0F133562741791476A19C8F	NaN	None
7E363573082201850D66D42	NaN	None

	IncomeVerifiable	CreditRisk	LoanStatus	EmploymentStatus	IncomeRange
LoanKey
04DF3672370940669E7B16B	False	medium	Completed	Self-employed	$100,000+
3D5836438779123523A2DA1	False	high	Current	Self-employed	$100,000+
ABFC358335339485700883A	True	medium	Completed	Full-time	$100,000+
C7963686536898613C60349	True	medium	Completed	Employed	$100,000+
2DE533983085546702B368E	True	medium	Chargedoff	Full-time	$100,000+