Hospital Readmission

Predicting Risk of Readmission for Diabetic Patients


Hospital readmissions are among the challenge that hospitals are facing today. Most of the readmission are believed to be preventable. Predictive models can be used to assess the chance of readmission of a patient within a given period. If the risk of readmission is high, patients can receive extended care before discharge so their chances of readmission are reduced. In this post, I will develop a logistic regression model that predicts the risk of readmission of diabetic patients.


In [3]:
frame1 = read.csv('Diabetes_Data.csv')
In [4]:
str(frame1)
'data.frame':	30530 obs. of  45 variables:
 $ race                      : Factor w/ 6 levels "'unknown'","AfricanAmerican",..: 4 4 4 4 4 4 4 4 2 4 ...
 $ gender                    : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 1 1 2 2 1 ...
 $ age                       : Factor w/ 10 levels "[0-10)","[10-20)",..: 6 8 7 6 8 8 9 9 4 8 ...
 $ weight                    : Factor w/ 10 levels "'unknown'","[0-25)",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ discharge_disposition_id  : int  25 23 1 1 6 3 11 11 1 2 ...
 $ admission_source_id       : int  1 17 7 1 7 17 7 7 17 7 ...
 $ time_in_hospital          : num  0 0.6154 0.0769 0.3077 0.3846 ...
 $ payer_code                : Factor w/ 17 levels "'unknown'","BC",..: 1 1 15 1 8 8 8 5 1 8 ...
 $ num_lab_procedures        : num  0.2519 0.5496 0.0916 0.4275 0.3206 ...
 $ num_procedures            : num  1 1 0 0.5 1 ...
 $ number_outpatient         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ number_emergency          : num  0 0 0 0 0.0132 ...
 $ number_inpatient          : num  0 0.0476 0.0476 0 0 ...
 $ diag_1                    : Factor w/ 19 levels "","blood","circulatory",..: 3 10 9 3 10 13 3 5 3 3 ...
 $ diag_2                    : Factor w/ 19 levels "","blood","circulatory",..: 6 18 19 3 2 2 3 2 3 3 ...
 $ diag_3                    : Factor w/ 19 levels "","blood","circulatory",..: 6 3 6 6 8 2 16 16 6 3 ...
 $ number_diagnoses          : num  0.333 0.533 0.533 0.2 0.533 ...
 $ max_glu_serum             : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 2 3 3 3 3 ...
 $ A1Cresult                 : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 4 2 3 ...
 $ metformin                 : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
 $ repaglinide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ nateglinide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ chlorpropamide            : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ glimepiride               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ acetohexamide             : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
 $ glipizide                 : Factor w/ 4 levels "Down","No","Steady",..: 2 3 2 2 2 2 2 2 2 2 ...
 $ glyburide                 : Factor w/ 4 levels "Down","No","Steady",..: 3 2 2 2 2 2 2 2 2 2 ...
 $ tolbutamide               : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
 $ pioglitazone              : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ rosiglitazone             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ acarbose                  : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ miglitol                  : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ troglitazone              : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
 $ tolazamide                : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ insulin                   : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 4 1 2 3 3 2 ...
 $ glyburide.metformin       : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ glipizide.metformin       : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
 $ glimepiride.pioglitazone  : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
 $ metformin.rosiglitazone   : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
 $ metformin.pioglitazone    : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
 $ change                    : Factor w/ 2 levels "Ch","No": 2 2 2 2 1 1 2 2 2 2 ...
 $ diabetesMed               : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 1 2 2 2 ...
 $ admission_type_description: Factor w/ 6 levels "Elective","Emergency",..: 5 5 6 1 2 5 2 2 5 2 ...
 $ num_meds_class            : Factor w/ 2 levels "large","normal": 2 1 2 1 2 1 2 1 2 2 ...
 $ readmi_class              : Factor w/ 2 levels "NO","YES": 1 2 1 1 1 1 1 1 2 1 ...
In [5]:
dim(frame1)
Out[5]:

    30530

    45

This diabetes readmission data is partly cleaned, but still a huge dataset. It still required more cleaning and transformation. The data contains 44 features and a categorical response (readmi_class), where a readmitted patient is represented by 'NO' and a non-readmitted patient is represented by 'Yes'. The features contains both numeric and categorical types of data.

In [6]:
# The values for the following features are missing and need to be removed
frame1$glipizide.metformin <-NULL
frame1$glimepiride.pioglitazone <-NULL
frame1$metformin.rosiglitazone <-NULL
frame1$metformin.pioglitazone <-NULL
In [8]:
# upload some of the libraries
library(ggplot2)
library(dplyr)

Explotory Data Analysis

Plot the features to visualize the distribution of the two classes of the response variable.

In [9]:
## Compare outcomes for various levels of factor (categorical) features. 
bar.plot <- function(x){
  if(is.factor(frame1[, x])){
    sums <- summary(frame1[, x], counts = n())
    msk <- names(sums[which(sums > 100)]) # if a feature has less than 100 observations, it will not be displayed
    tmp <- frame1[frame1[, x] %in% msk, c('readmi_class', x)]
    if(strsplit(x, '[-]')[[1]][1] == x){
      g <- ggplot(tmp, aes_string(x)) +
        geom_bar() +
        facet_grid(. ~ readmi_class) +
        ggtitle(paste('Readmissions by level of', x))
      print(g)
    }
  }
}
cols <- names(frame1)
cols <- cols[1:(length(cols) - 1)]
#lapply(cols, bar.plot)

For majority of the categorical features, the distributions of both classes are almost the same.

In [76]:
## Make box plots of the numeric columns
violin.plot <- function(x){
  if(is.numeric(frame1[, x])){
    ggplot(frame1, aes_string('readmi_class', x)) +
      geom_violin(alpha=0.1) +
      ggtitle(paste('Readmissions by', x))
  }
}
In [1]:
#lapply(names(frame1), violin.plot)

I am not going to explain each feature here, but it seems that difference between the readmitted and the non-readmitted classes seems to be more obvious looking at violin plot for some of th numeric features. The shape of the violin plots for number of inpatient vistis and the length of stay of patient (time in hospital) are worth noting.

In [131]:
# The followin feature are almost identical for the to classes, and they can be removed
frame1$race <- NULL
frame1$weight <- NULL
frame1$payer_code <- NULL
frame1$diag_1 <- NULL
frame1$diag_2 <- NULL
frame1$diag_3 <- NULL
frame1$acetohexamide <-NULL
frame1$chlorpropamide <- NULL
frame1$acarbose <- NULL
frame1$miglitol <- NULL
The Model

Now, let's build the logistic regression classification model. First, I will divide the data into training and testing datasets. Size the size of the data is large, I will use 60% of the data for training.

In [132]:
library(caTools)

split=sample.split(frame1$readmi_class,SplitRatio = 0.6)

train<-frame1[split==TRUE, ]
test<-frame1[split==FALSE, ]
In [133]:
glm_model <- glm(readmi_class ~., data=train, family="binomial")
In [134]:
summary(glm_model)
Out[134]:
Call:
glm(formula = readmi_class ~ ., family = "binomial", data = train)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-3.7822  -1.0421  -0.8266   1.2059   1.8808

Coefficients:
                                          Estimate Std. Error z value Pr(>|z|)
(Intercept)                              1.061e+01  3.247e+02   0.033 0.973946
discharge_disposition_id                -1.272e-02  3.030e-03  -4.196 2.71e-05
admission_source_id                     -8.117e-04  5.516e-03  -0.147 0.883010
time_in_hospital                         2.926e-01  7.863e-02   3.721 0.000198
num_lab_procedures                       1.504e-01  1.184e-01   1.270 0.204027
num_procedures                          -2.781e-01  5.951e-02  -4.672 2.98e-06
number_outpatient                        4.091e+00  6.037e-01   6.775 1.24e-11
number_emergency                         1.563e+01  2.210e+00   7.071 1.54e-12
number_inpatient                         7.333e+00  3.426e-01  21.404  < 2e-16
number_diagnoses                         1.318e+00  1.289e-01  10.224  < 2e-16
max_glu_serum>300                        3.350e-01  1.917e-01   1.747 0.080554
max_glu_serumNone                        4.107e-01  1.492e-01   2.753 0.005911
max_glu_serumNorm                        2.044e-01  1.653e-01   1.237 0.216135
A1Cresult>8                              3.216e-02  9.530e-02   0.337 0.735782
A1CresultNone                            4.713e-02  8.042e-02   0.586 0.557847
A1CresultNorm                           -1.866e-01  1.068e-01  -1.747 0.080609
metforminNo                              4.138e-01  1.916e-01   2.160 0.030769
metforminSteady                          3.214e-01  1.913e-01   1.680 0.092950
metforminUp                              1.142e-01  2.435e-01   0.469 0.639146
repaglinideNo                            4.335e-02  6.481e-01   0.067 0.946677
repaglinideSteady                        2.760e-01  6.602e-01   0.418 0.675903
repaglinideUp                           -5.899e-02  8.260e-01  -0.071 0.943068
nateglinideNo                           -1.275e+01  3.247e+02  -0.039 0.968687
nateglinideSteady                       -1.249e+01  3.247e+02  -0.038 0.969322
nateglinideUp                           -1.324e+01  3.247e+02  -0.041 0.967475
glimepirideNo                            6.285e-01  4.319e-01   1.455 0.145612
glimepirideSteady                        6.467e-01  4.356e-01   1.485 0.137631
glimepirideUp                            6.296e-01  5.110e-01   1.232 0.217863
glipizideNo                             -4.615e-01  2.142e-01  -2.155 0.031177
glipizideSteady                         -3.417e-01  2.147e-01  -1.592 0.111478
glipizideUp                             -4.676e-01  2.718e-01  -1.720 0.085370
glyburideNo                             -6.119e-02  2.091e-01  -0.293 0.769833
glyburideSteady                          1.814e-02  2.096e-01   0.087 0.931029
glyburideUp                              7.721e-02  2.591e-01   0.298 0.765651
tolbutamideSteady                       -1.305e+00  1.138e+00  -1.147 0.251299
pioglitazoneNo                          -5.002e-01  4.419e-01  -1.132 0.257631
pioglitazoneSteady                      -4.451e-01  4.447e-01  -1.001 0.316797
pioglitazoneUp                           1.006e-01  5.222e-01   0.193 0.847212
rosiglitazoneNo                          7.171e-01  5.364e-01   1.337 0.181226
rosiglitazoneSteady                      8.388e-01  5.389e-01   1.557 0.119585
rosiglitazoneUp                          6.541e-01  6.423e-01   1.018 0.308532
troglitazoneSteady                       1.271e+01  3.247e+02   0.039 0.968791
tolazamideSteady                        -2.733e-02  7.386e-01  -0.037 0.970489
tolazamideUp                             1.277e+01  3.247e+02   0.039 0.968634
insulinNo                               -9.712e-02  8.257e-02  -1.176 0.239529
insulinSteady                           -2.450e-01  6.328e-02  -3.872 0.000108
insulinUp                               -6.856e-02  6.457e-02  -1.062 0.288306
glyburide.metforminNo                   -3.013e-01  1.448e+00  -0.208 0.835190
glyburide.metforminSteady               -1.003e-01  1.459e+00  -0.069 0.945169
glyburide.metforminUp                   -1.331e+01  3.247e+02  -0.041 0.967318
changeNo                                 1.602e-02  5.939e-02   0.270 0.787368
diabetesMedYes                           2.764e-01  5.698e-02   4.852 1.22e-06
admission_type_descriptionEmergency      6.649e-02  5.334e-02   1.247 0.212525
admission_type_descriptionNewborn       -1.176e+01  3.247e+02  -0.036 0.971118
admission_type_descriptionTrauma Center -1.180e+01  1.607e+02  -0.073 0.941465
admission_type_descriptionunknown        3.211e-01  8.072e-02   3.977 6.97e-05
admission_type_descriptionUrgent         8.080e-02  5.333e-02   1.515 0.129745
num_meds_classnormal                     2.057e-02  4.211e-02   0.489 0.625177

(Intercept)
discharge_disposition_id                ***
admission_source_id
time_in_hospital                        ***
num_lab_procedures
num_procedures                          ***
number_outpatient                       ***
number_emergency                        ***
number_inpatient                        ***
number_diagnoses                        ***
max_glu_serum>300                       .
max_glu_serumNone                       **
max_glu_serumNorm
A1Cresult>8
A1CresultNone
A1CresultNorm                           .
metforminNo                             *
metforminSteady                         .
metforminUp
repaglinideNo
repaglinideSteady
repaglinideUp
nateglinideNo
nateglinideSteady
nateglinideUp
glimepirideNo
glimepirideSteady
glimepirideUp
glipizideNo                             *
glipizideSteady
glipizideUp                             .
glyburideNo
glyburideSteady
glyburideUp
tolbutamideSteady
pioglitazoneNo
pioglitazoneSteady
pioglitazoneUp
rosiglitazoneNo
rosiglitazoneSteady
rosiglitazoneUp
troglitazoneSteady
tolazamideSteady
tolazamideUp
insulinNo
insulinSteady                           ***
insulinUp
glyburide.metforminNo
glyburide.metforminSteady
glyburide.metforminUp
changeNo
diabetesMedYes                          ***
admission_type_descriptionEmergency
admission_type_descriptionNewborn
admission_type_descriptionTrauma Center
admission_type_descriptionunknown       ***
admission_type_descriptionUrgent
num_meds_classnormal
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 25282  on 18317  degrees of freedom
Residual deviance: 23851  on 18260  degrees of freedom
AIC: 23967

Number of Fisher Scoring iterations: 11

Note that looking at the p-values, many features are not statistically significant and can be removed without major effect on the predition accuracy.

Prediction on the testing dataset
In [135]:
test_predicted <- predict(glm_model, newdata = test, type = "response")

Now, lets plot the receiver operating characteristic (ROC) curve.

In [176]:
# First, install the ROCR package
 install.packages("ROCR",repos='http://cran.us.r-project.org')
Installing package into 'C:/Users/Desta/Documents/R/win-library/3.1'
(as 'lib' is unspecified)
Warning message:
: package 'ROCR' is in use and will not be installed
In [177]:
#load the ROC package
library(ROCR)
In [178]:
ROCR_prediction <- prediction(test_predicted, test$readmi_class)
ROCR_performace <- performance(ROCR_prediction, "tpr", "fpr")

# Plot the ROC curve

options(repr.plot.width = 10)
options(repr.plot.height = 6)
plot(ROCR_performace, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
In [141]:
# Calculate the area under the curve
performance(ROCR_prediction, "auc")@y.values
Out[141]:

    0.654

Given the messinessof readmission data, AUC of 0.65 is not bad. Let's the confusion matrix to see the prediction accuracy of the model for the readmitted patients.

In [149]:
# confusion matrix with cut-off value of 0.5
table(test$readmi_class, test_predicted>0.5)
Out[149]:
      FALSE TRUE
  NO   5231 1353
  YES  3352 2276
In [173]:
sensitivity = round((2276/(2276+3352)),2)

specificity = round((5231/(5231+1353)),2)
accuracy = round((5231+2276)/sum(table(test$readmi_class, test_predicted>0.5)),2)

cat('accuracy: ', accuracy, '\n')
cat("sensitivity: ", sensitivity, '\n')
cat("specifity: ", specificity)
accuracy:  0.61
sensitivity:  0.4
specifity:  0.79