This report investigates the Kaggle no-show appointments dataset, which collects information from over 100k medical appointments in Brazil. There are a number of patient characteristics included and the aim of the report is to compare and contrast the no-show cases with those who showed up for their appointments.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
# %config InlineBackend.figure_format = 'retina' # high-resolution plots
# read the dataset
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()
# get dataset dimensions
df.shape
# show summary for all columns
df.describe(include = 'all')
df.hist(figsize=(10,10));
# check for duplicates - there are none
print(df.duplicated().sum())
# how many different patients exist in the dataset
df.PatientId.nunique()
# check that AppointmentID is a unique row identifier
df.AppointmentID.nunique()
AppointmentID will not be interesting in our analysis, so let's drop it in the following Data Cleaning section.
Let's rename mistyped column names such as Hipertension and Handcap.
Let's also rename SMS_received and No-show to have a consistent column naming.
Let's also further investigate the suspicious minimum and maximum values of Age and values of Handcap:
df['Age'].value_counts()
df[df['Age']==-1]
# age of -1 is obviously an error, let's change it to 0 in the next section
df.groupby(['Age', 'PatientId'])['Age'].count()
# there are actually just two patients aged 115, which might be plausible
df['Handcap'].value_counts()
# there are a few cases of more than one handicap, so handicap is a categorical variable,
# where 4 means that a patient has 4 different handicaps
# check for missing values and data types
df.info()
# confirmation that there are no null values
df.isnull().sum().any()
There are no missing values, but there are issues with data types:
df['Neighbourhood'].value_counts()
# there are 81 places, some of them with large number of appointments, it makes sense to categorize it
Let's perform the cleaning steps which were outlined above.
# dropping AppointmentID column
df.drop(['AppointmentID'], axis=1, inplace=True)
# renaming mistyped columns
df.rename(columns={'Hipertension': 'Hypertension', 'Handcap': 'Handicap', 'SMS_received': 'SMSReceived', 'No-show': 'NoShow'},\
inplace=True)
# confirm changes
df.head(5)
# check for duplicates again after dropping ID column
print(df.duplicated().sum())
# show some of the duplicates
df_dup = df[df.duplicated()]
df_dup.head(5)
After deleting AppointmentID, which was an automated counter, there are now 618 duplicate values.
It could be an error, where in reality there should be only one entry per duplicate, but it could also be valid: patients can indeed be scheduled for more than one appointment at the same time.
I decided to keep duplicate values in the data set and treat them as separate appointments.
# change Age=-1 to a more realistic Age=0
df['Age'].replace({-1: 0}, inplace=True)
# confirm the change
df['Age'].value_counts()
# change Gender, Neighbourhood, Handicap to categorical data type
for c in ['Gender', 'Neighbourhood', 'Handicap']:
df[c] = df[c].astype('category')
# day columns data type change
for c in ['ScheduledDay', 'AppointmentDay']:
df[c] = pd.to_datetime(df[c])
# change Scholarship, Hypertension, Diabetes, Alcoholism, SMSReceived to bool data type
for c in ['Scholarship', 'Hypertension', 'Diabetes', 'Alcoholism', 'SMSReceived']:
df[c] = df[c].astype('bool')
# NoShow column transformation
df['NoShow'].replace({'No': 0, 'Yes': 1}, inplace=True)
# NoShow type change
df['NoShow'] = df['NoShow'].astype('bool')
# confirm changed data types
df.dtypes
# check columns summary
df.describe(include = 'all')
# check a few rows of the cleaned data
df.head()
# countplot function usable in subplots
def count_plot(dfdata, xdata, gxlabel, gylabel, gtitle, gcolor=None, gpalette=None):
g = sns.countplot(data = dfdata, x = xdata, color = gcolor, palette= gpalette)
plt.title(gtitle)
plt.xlabel(gxlabel)
plt.ylabel(gylabel)
return g
# let's first look at details of individual patient characteristics and of our dependent variable: NoShow
# share of NoShows on all appointments
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
count_plot(df, 'NoShow', 'No Show', 'Count of appointments', 'Appointments count based on showing up')
plt.subplot(1,2,2)
df.NoShow.value_counts().plot(kind='pie',autopct='%1.0f%%')
plt.title('Share of no show appointments')
plt.show()
# calculate total counts of no shows in data
noshow_totals = df.groupby('NoShow').count()['Age']
noshow_totals
# there are 22319 no shows in the data, which makes 20% of all appointments
noshow_proportion = df['NoShow'].sum() / len(df.index)
noshow_proportion
# the highest correlation is for hypertension and age, hypertension and diabetes, and diabetes and age
df.corr().style.background_gradient(cmap='Blues')
# age hypertension scatter with line fit
# there are more people with hypertension among older patients, as correlation of 0.5 from above suggests
hypertension_mean = df['Hypertension'].groupby(df['Age']).mean()
ages = df['Age'].unique()
sns.regplot(x = ages, y = hypertension_mean)
plt.xlabel('Age')
plt.show()
# age diabetes scatter with line fit
# there are more people with diabetes among older patients
diabetes_mean = df['Diabetes'].groupby(df['Age']).mean()
ages = df['Age'].unique()
sns.regplot(x = ages, y = diabetes_mean)
plt.xlabel('Age')
plt.show()
# the outlier is caused by having only one appointment with a patient aged 99 years and this patient is diabetic
df.query('Age >= 99').head(15)
# repeating the above, but cutting out the one outlier appointment to see the correlation better
diabetes_mean = df['Diabetes'].groupby(df['Age']).mean()
ages = df['Age'].unique()
sns.regplot(x = ages, y = diabetes_mean)
plt.xlabel('Age')
plt.ylim(0,0.4)
plt.show()
# age distribution
# young people have many appointments, which could be because of a number of mandatory medical visits at that age
g = df.Age.hist(bins=20);
g.set(xlabel = 'Age', ylabel = 'Count of appointments', title = 'Age distribution')
plt.show()
# barplot function
def bar_plot(xdata, ydata, gxlabel, gylabel, gtitle, gerrwidth=None, gpalette=None):
g = sns.barplot(x = xdata, y = ydata, errwidth = gerrwidth, palette = gpalette)
g.set(xlabel = gxlabel, ylabel = gylabel, title = gtitle)
return g
# proportion of appointments of patients on scholarship by age
# it might be the case that scholarship is aimed predominantly at children and parents of children based on the plot
plt.figure(figsize=(30,8))
bar_plot(df['Age'], df['Scholarship'], 'Age', 'Scholarship rate', 'Scholarship rate by age', 0)
plt.show()
# how alcoholism is distributed among patients scheduling appointments by age
# the most endangered group are patients between their 40s and 60s
plt.figure(figsize=(30,8))
bar_plot(df['Age'], df['Alcoholism'], 'Age', 'Alcoholism rate', 'Alcoholism rate by age', 0)
plt.show()
# there seems to be a decreasing no show rate as age increases between ages 15-80
plt.figure(figsize=(30,8))
plt.ylim(0,0.7)
bar_plot(df['Age'], df['NoShow'], 'Age', 'No show rate', 'No show rate by age', 0)
plt.show()
# the above is also confirmed by the following plot
# there are fewer data points for ages higher than 90 (check histogram above), that is why the upper end is so volatile
y_means = df['NoShow'].groupby(df['Age']).mean()
x_age = df['Age'].unique()
x_age = sorted(x_age)
plt.errorbar(x = x_age, y = y_means)
plt.title('No show rate by age')
plt.xlabel('Age')
plt.ylabel('No show rate')
plt.show()
# most appointments are made by patients with no handicap
count_plot(df, 'Handicap', 'Handicap', 'Count of appointments', 'Handicap distribution', sns.color_palette()[0])
plt.show()
# age distribution for different handicap groups
# age distribution of patients with 4 handicaps is very different because we only have a very small sample of these patients
g = sns.boxplot(data = df, x = 'Age', y = 'Handicap')
g.set(xlabel='Age', ylabel='Number of handicaps', title='Age distribution by number of handicaps')
plt.show()
# number of handicaps does not make much of a difference in showing up
# (we can disregard 3 and 4 handicaps because there are just a few data points, as is suggested by the large error bars and distribution chart above)
g = sns.barplot(x = df['NoShow'], y = df['Handicap'])
g.set(xlabel='No show rate', ylabel='Number of handicaps', title='Appointments no show rate by number of handicaps')
plt.show()
# share of neighbourhoods
# there are a couple of very large neighbourhoods by a number of appointments
plt.figure(figsize=(35,30))
plt.subplot(2, 1, 1)
plt.xticks(rotation=90)
g = sns.countplot(data = df, x = 'Neighbourhood', order = df['Neighbourhood'].value_counts().index, \
color=sns.color_palette()[0])
g.set(ylabel = 'Count of appointments', title = 'Appointments count by neighbourhood')
# neighbourhoods do not deviate widely from the overall 20% no show rate
plt.subplot(2, 1, 2)
plt.xticks(rotation=90)
g = sns.barplot(x = df['Neighbourhood'], y = df['NoShow'], order = df['Neighbourhood'].value_counts().index, \
color=sns.color_palette()[0]);
g.set_title('Appointments no show rate by neighbourhood')
g.set(ylabel='No show rate', ylim=(0,0.35), xlabel='')
g.axhline(noshow_proportion)
plt.show()
# there are more females scheduling appointments
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
count_plot(df, 'Gender', 'Gender', 'Count of appointments', 'Appointments count by gender')
plt.subplot(1,2,2)
g2 = df.Gender.value_counts().plot(kind='pie',autopct='%1.0f%%')
g2.set(title = 'Share of appointments by gender')
plt.show()
Females seem to be either in more need of medical appointments or simply taking better care of their health.
Let's check below whether this also means that they miss fewer of their appointments than males do.
# number of appointments by gender
gender_totals = df['Gender'].value_counts()
gender_totals
# calculate noshow counts by gender
noshow_by_gender = df.groupby(['Gender', 'NoShow']).count()['Age']
noshow_by_gender
# plotting these counts does not help for distinguishing gender difference in no shows much because
# there are many more female patients in our data
g = sns.countplot(data = df, x = 'Gender', hue = 'NoShow')
g.set(ylabel = 'Count of appointments', title = 'No show distribution by gender')
plt.show()
# calculating frequencies of noshows for females
noshow_proportions_female = noshow_by_gender['F'] / gender_totals['F']
noshow_proportions_female
# calculating frequencies of noshows for males
noshow_proportions_male = noshow_by_gender['M'] / gender_totals['M']
noshow_proportions_male
# plot mean of no-show
bar_plot(df['Gender'], df['NoShow'], 'Gender', 'No show rate', 'Appointments no show rate by gender', 0)
plt.show()
There is no large difference in noshow proportion among male and female patients.
Let's now see the distribution of no shows for boolean characteristics:
# subplots with amounts of appointments per characteristic value
plt.figure(figsize = [20, 9])
plt.subplot(2, 3, 1)
g1 = count_plot(df, 'Alcoholism', 'Alcoholism', 'Count of appointments', '')
plt.subplot(2, 3, 2)
count_plot(df, 'Diabetes', 'Diabetes', 'Count of appointments', '')
plt.ylim(g1.get_ylim())
plt.subplot(2, 3, 3)
count_plot(df, 'Hypertension', 'Hypertension', 'Count of appointments', '')
plt.ylim(g1.get_ylim())
plt.subplot(2, 3, 4)
count_plot(df, 'Scholarship', 'Scholarship', 'Count of appointments', '')
plt.ylim(g1.get_ylim())
plt.subplot(2, 3, 5)
count_plot(df, 'SMSReceived', 'SMSReceived', 'Count of appointments', '')
plt.ylim(g1.get_ylim())
plt.subplot(2, 3, 6)
count_plot(df, 'Gender', 'Gender', 'Count of appointments', '')
plt.ylim(g1.get_ylim())
plt.show()
Among available boolean characteristics, we can hypothesize the following:
Let's plot no show rates to see what patterns there are in the data.
# plot details function to be used in the following bar charts
def plot_details(plot):
plot.set(ylim=(0,0.3))
plot.axhline(noshow_proportion, color='black')
plt.ylim(g1.get_ylim())
# subplots with no show rates per characteristic
# the overall no show rate is shown by a horizontal line
plt.figure(figsize = [20, 9])
plt.subplot(2, 3, 1)
g1 = bar_plot(df['Alcoholism'], df['NoShow'], 'Alcoholism', 'No show rate', '', 0)
plot_details(g1)
plt.subplot(2, 3, 2)
g2 = bar_plot(df['Diabetes'], df['NoShow'], 'Diabetes', 'No show rate', '', 0)
plot_details(g2)
plt.subplot(2, 3, 3)
g3 = bar_plot(df['Hypertension'], df['NoShow'], 'Hypertension', 'No show rate', '', 0)
plot_details(g3)
plt.subplot(2, 3, 4)
g4 = bar_plot(df['Scholarship'], df['NoShow'], 'Scholarship', 'No show rate', '', 0)
plot_details(g4)
plt.subplot(2, 3, 5)
g5 = bar_plot(df['SMSReceived'], df['NoShow'], 'SMS Received', 'No show rate', '', 0)
plot_details(g5)
plt.subplot(2, 3, 6)
g6 = bar_plot(df['Gender'], df['NoShow'], 'Gender', 'No show rate', '', 0)
plot_details(g6)
plt.show()
The plots reveal the following:
# low no show rate is expected for the below combination of characteristics
combined_issues = df.query('Diabetes == True and Hypertension == True and Handicap > 0')
len(combined_issues)
noshows_combined_issues = df.query('NoShow == True and Diabetes == True and Hypertension == True and Handicap > 0')
len(noshows_combined_issues)
# the expectation is confirmed
noshow_share_combined_issues = len(noshows_combined_issues) / len(combined_issues)
noshow_share_combined_issues
# high no show rate is expected for the below combination of characteristics
combined_issues2 = df.query('Scholarship == True and SMSReceived == True')
len(combined_issues2)
noshows_combined_issues2 = df.query('NoShow == True and Scholarship == True and SMSReceived == True')
len(noshows_combined_issues2)
# the expectation is confirmed
noshow_share_combined_issues2 = len(noshows_combined_issues2) / len(combined_issues2)
noshow_share_combined_issues2
# count number of days between appointment scheduled day and appointment day
datediff = df['AppointmentDay'] - df['ScheduledDay']
datediff = pd.to_timedelta(datediff,'D')
datediff.describe()
# extract just days from the calculated difference
datediff = round(datediff / np.timedelta64(1, 'D')).astype('int')
# distribution of date differences - there are many appointments with schedule and appointment at the same day,
# which means that patients are probably not scheduling in advance
# this would suggest that no shows for day difference equal to 0 should be close to 0%, let's look into it below
datediff.hist(bins=100)
plt.xlabel('Number of days')
plt.ylabel('Count of appointments')
plt.title('Distribution of number of days between schedule date and appointment date')
plt.show()
# merge datediff to df
df['DateDiff'] = datediff
# there are 12161 appointments with negative day difference
len(df.query('DateDiff < 0'))
# almost all of these have difference of -1
len(df.query('DateDiff == -1'))
# this is the case because when the appointment happens on the same date as it is scheduled, the scheduled day includes time,
# but the appointment date does NOT
# these should therefore be changed to diff = 0
df.query('DateDiff < 0').head()
# let's change these remaining 3 negative cases to 0, too
df.query('DateDiff < -1').head()
# change negative differences to 0 because they must be errors
df['DateDiff'] = df['DateDiff'].clip(lower=0)
# there are 40972 appointments scheduled and happening on the same day
len(df.query('DateDiff == 0'))
# check DateDiff summary
df['DateDiff'].describe()
plt.figure(figsize=(30,10))
bar_plot(df['DateDiff'], df['NoShow'], 'Number of days between schedule day and appointment day', 'No show rate', \
'Appointments no show rate by number of days between schedule day and the appointment day')
plt.show()
It is clear from the above chart that appointments happening on the scheduled day (most of which are probably in reality not being scheduled at all) largely decrease the no show rate to the overall 20%.
Most of the other date difference values have no show rates higher than 20%.
# there are generally fewer data points for date diff higher than 50 (we know that from the histogram higher up),
# that is why the upper end is so volatile
y_means = df['NoShow'].groupby(df['DateDiff']).mean()
x_datediff = df['DateDiff'].unique()
x_datediff = sorted(x_datediff)
plt.errorbar(x = x_datediff, y = y_means)
plt.title('Appointments no show rate by number of days between schedule day and the appointment day')
plt.xlabel('Number of days between schedule day and the appointment day')
plt.ylabel('No show rate')
plt.show()
# looking more closely to the relevant part to see the pattern
y_means = df['NoShow'].groupby(df['DateDiff']).mean()
x_datediff = df['DateDiff'].unique()
x_datediff = sorted(x_datediff)
plt.errorbar(x = x_datediff, y = y_means)
plt.title('Appointments no show rate by number of days between schedule day and the appointment day')
plt.xlabel('Number of days between schedule day and the appointment day')
plt.ylabel('No show rate')
plt.xlim(0,50)
plt.ylim(0,0.45)
plt.show()
Looking at this detail, we can say that it seems to matter how long in advance the appointment is scheduled.
The increasing no show rate is visible in the first 20 days of difference. The longer the period between schedule and appointment day, the higher the no show rate in our data generally tends to be.
Further, knowing that there are almost 41k of 0 days difference values, it might make sense to look again only at a subset which excludes these values, in which we would expect generally higher no show rates than we saw until now.
Let's look at the main no show charts from Q1 and continue further with the shrinked data frame.
# filter 0 diff days out of the data frame
df2 = df.query('DateDiff > 0')
df2.shape
# overall no show rate of the shrinked data set is 28.8%
noshow_proportion = df2['NoShow'].sum() / len(df2.index)
noshow_proportion
# subplots with no show rates per characteristic
# the overall no show rate is shown by a horizontal line
plt.figure(figsize = [20, 9])
# plot details function to be used in the following bar charts
def plot_details2(plot):
plot.set(ylim=(0,0.4))
plot.axhline(noshow_proportion, color='black')
plt.ylim(g1.get_ylim())
plt.subplot(2, 3, 1)
g1 = bar_plot(df2['Alcoholism'], df2['NoShow'], 'Alcoholism', 'No show rate', '', 0, 'deep')
plot_details2(g1)
plt.subplot(2, 3, 2)
g2 = bar_plot(df2['Diabetes'], df2['NoShow'], 'Diabetes', 'No show rate', '', 0, 'deep')
plot_details2(g2)
plt.subplot(2, 3, 3)
g3 = bar_plot(df2['Hypertension'], df2['NoShow'], 'Hypertension', 'No show rate', '', 0, 'deep')
plot_details2(g3)
plt.subplot(2, 3, 4)
g4 = bar_plot(df2['Scholarship'], df2['NoShow'], 'Scholarship', 'No show rate', '', 0, 'deep')
plot_details2(g4)
plt.subplot(2, 3, 5)
g5 = bar_plot(df2['SMSReceived'], df2['NoShow'], 'SMS Received', 'No show rate', '', 0, 'deep')
plot_details2(g5)
plt.subplot(2, 3, 6)
g6 = bar_plot(df2['Gender'], df2['NoShow'], 'Gender', 'No show rate', '', 0, 'deep')
plot_details2(g6)
plt.show()
Some patterns changed now when we only look at appointments scheduled at least one day in advance.
Appointments of alcoholics are skipped more frequently than the average appointments are.
Receiving an SMS reminder now has a positive effect on no show rate, i.e. appointments with an SMS have lower no show rate than those without a reminder and in general.
Findings for other characteristics did not change.
# neighbourhoods still do not deviate widely from the overall no show rate
plt.figure(figsize=(35,10))
plt.xticks(rotation=90)
g = sns.barplot(x = df2['Neighbourhood'], y = df2['NoShow'], order = df['Neighbourhood'].value_counts().index, \
color=sns.color_palette()[0]);
g.set_title('Appointments no show rate by neighbourhood')
g.set(ylabel='No show rate', ylim=(0,0.5), xlabel='')
g.axhline(noshow_proportion)
plt.show()
# no show rate still decreases with age in the filtered data
y_means = df2['NoShow'].groupby(df2['Age']).mean()
x_age = df2['Age'].unique()
x_age = sorted(x_age)
plt.errorbar(x = x_age, y = y_means)
plt.title('No show rate by age')
plt.xlabel('Age')
plt.ylabel('No show rate')
plt.xlim(0,80)
plt.show()
# add appointment day of week to data frame
dow = df2['AppointmentDay'].dt.day_name()
df2 = df2.assign(AppointmentDayOfWeek = dow)
df2.head()
week_key = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.figure(figsize = [9, 10])
plt.subplot(2, 1, 1)
g1 = sns.barplot(x = df2['AppointmentDayOfWeek'], y = df2['NoShow'], order = week_key)
g1.set(xlabel='', ylabel='No show rate', ylim=(0,0.35), title='Appointments no show rate by day of week')
plt.subplot(2, 1, 2)
g2 = sns.countplot(data = df2, x = 'AppointmentDayOfWeek', order = week_key)
g2.set(xlabel='Appointment day of week', ylabel='Count of appointments', title='Appointments count by day of week')
plt.show()
No show rate on Fridays is a bit higher than on Tue, Wed, or Thu, but it is very comparable to no show rate on Mondays.
Mo, Tue, and Wed are the favourite days for appointments.
This topic was already covered in overall data exploration in Q1 on the full data set and in Q2 on the filtered data set.
Let's summarize:
# NoShow and SMS received correlation now has a negative sign, but it is quite small
df2.corr().style.background_gradient(cmap='Blues')
# subplot for comparison with the full data
plt.figure(figsize = (12, 4))
plt.subplot(1, 2, 1)
g1 = count_plot(df2, 'SMSReceived', 'SMS received', 'Count of appointments', 'Appointments count on filtered data', \
None, 'deep')
g1.set(ylim=(0, 80000))
plt.subplot(1, 2, 2)
g2 = count_plot(df, 'SMSReceived', 'SMS received', 'Count of appointments', 'Appointments count on full data')
plt.ylim(g1.get_ylim())
plt.show()
# 35482 appointments received an SMS
SMS = df2['SMSReceived'].value_counts()
SMS
# that makes around 51% of appointments with a received SMS
# this is much a larger share than on the full data set
SMS_proportion = df2['SMSReceived'].sum() / len(df2.index)
SMS_proportion
We filtered out data with day difference between scheduling and appointment of 0.
These filtered out appointments did not receive an SMS. It makes sense that patients do not get an SMS for appointments on the same day.
The filtered data therefore has much more comparable group sizes for appointments with and without an SMS reminder.
# subplots for comparison
plt.figure(figsize = [12, 4])
plt.subplot(1, 2, 1)
# no show rate for appointments with an SMS reminder is now lower than for appointments with no reminder
g1 = bar_plot(df2['SMSReceived'], df2['NoShow'], 'SMS received', 'No show rate', 'Appointments no show rate on filtered data', \
0, 'deep')
plt.subplot(1, 2, 2)
g2 = bar_plot(df['SMSReceived'], df['NoShow'], 'SMS received', 'No show rate', 'Appointments no show rate on full data', \
0)
plt.ylim(g1.get_ylim())
plt.show()
# no show rate for appointments with SMS received
noshow_share_SMS = len(df2.query('NoShow == True and SMSReceived == True')) / len(df2.query('SMSReceived == True'))
noshow_share_SMS
# no show rate for appointments with SMS received
show_share_SMS = len(df2.query('NoShow == True and SMSReceived == False')) / len(df2.query('SMSReceived == False'))
show_share_SMS
Appointments no show rate is lower for cases when SMS was sent (27.6%) than for those without a reminder (30%) and also lower than the overall no show rate of 28.8% (for the filtered data set).
While doing EDA, it became apparent that it would make sense to look only at data where scheduling and appointment did not happen on the same day. On this reduced data set, the main findings are the following:
Yes, there is an increasing trend of no show rate as the day difference between schedule and appointment day gets larger, especially within the first 20 days.
Not really. The no show frequency on Fridays is a little higher than average, but very comparable to the one on Mondays.
Yes, sending an SMS seems to be somewhat helpful in reducing no shows, but the correlation is very small.