# Predicting Risk of Readmission for Diabetic Patients¶

Hospital readmissions are among the challenge that hospitals are facing today. Most of the readmission are believed to be preventable. Predictive models can be used to assess the chance of readmission of a patient within a given period. If the risk of readmission is high, patients can receive extended care before discharge so their chances of readmission are reduced. In this post, I will develop a logistic regression model that predicts the risk of readmission of diabetic patients.

In [3]:
frame1 = read.csv('Diabetes_Data.csv')

In [4]:
str(frame1)

'data.frame':	30530 obs. of  45 variables:
$race : Factor w/ 6 levels "'unknown'","AfricanAmerican",..: 4 4 4 4 4 4 4 4 2 4 ...$ gender                    : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 1 1 1 2 2 1 ...
$age : Factor w/ 10 levels "[0-10)","[10-20)",..: 6 8 7 6 8 8 9 9 4 8 ...$ weight                    : Factor w/ 10 levels "'unknown'","[0-25)",..: 1 1 1 1 1 1 1 1 1 1 ...
$discharge_disposition_id : int 25 23 1 1 6 3 11 11 1 2 ...$ admission_source_id       : int  1 17 7 1 7 17 7 7 17 7 ...
$time_in_hospital : num 0 0.6154 0.0769 0.3077 0.3846 ...$ payer_code                : Factor w/ 17 levels "'unknown'","BC",..: 1 1 15 1 8 8 8 5 1 8 ...
$num_lab_procedures : num 0.2519 0.5496 0.0916 0.4275 0.3206 ...$ num_procedures            : num  1 1 0 0.5 1 ...
$number_outpatient : num 0 0 0 0 0 0 0 0 0 0 ...$ number_emergency          : num  0 0 0 0 0.0132 ...
$number_inpatient : num 0 0.0476 0.0476 0 0 ...$ diag_1                    : Factor w/ 19 levels "","blood","circulatory",..: 3 10 9 3 10 13 3 5 3 3 ...
$diag_2 : Factor w/ 19 levels "","blood","circulatory",..: 6 18 19 3 2 2 3 2 3 3 ...$ diag_3                    : Factor w/ 19 levels "","blood","circulatory",..: 6 3 6 6 8 2 16 16 6 3 ...
$number_diagnoses : num 0.333 0.533 0.533 0.2 0.533 ...$ max_glu_serum             : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 2 3 3 3 3 ...
$A1Cresult : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 4 2 3 ...$ metformin                 : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
$repaglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...$ nateglinide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
$chlorpropamide : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...$ glimepiride               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
$acetohexamide : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...$ glipizide                 : Factor w/ 4 levels "Down","No","Steady",..: 2 3 2 2 2 2 2 2 2 2 ...
$glyburide : Factor w/ 4 levels "Down","No","Steady",..: 3 2 2 2 2 2 2 2 2 2 ...$ tolbutamide               : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
$pioglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...$ rosiglitazone             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
$acarbose : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...$ miglitol                  : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
$troglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...$ tolazamide                : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
$insulin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 4 1 2 3 3 2 ...$ glyburide.metformin       : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
$glipizide.metformin : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...$ glimepiride.pioglitazone  : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
$metformin.rosiglitazone : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...$ metformin.pioglitazone    : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
$change : Factor w/ 2 levels "Ch","No": 2 2 2 2 1 1 2 2 2 2 ...$ diabetesMed               : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 1 2 2 2 ...
$admission_type_description: Factor w/ 6 levels "Elective","Emergency",..: 5 5 6 1 2 5 2 2 5 2 ...$ num_meds_class            : Factor w/ 2 levels "large","normal": 2 1 2 1 2 1 2 1 2 2 ...
$readmi_class : Factor w/ 2 levels "NO","YES": 1 2 1 1 1 1 1 1 2 1 ...  In [5]: dim(frame1)  Out[5]: 30530 45 This diabetes readmission data is partly cleaned, but still a huge dataset. It still required more cleaning and transformation. The data contains 44 features and a categorical response (readmi_class), where a readmitted patient is represented by 'NO' and a non-readmitted patient is represented by 'Yes'. The features contains both numeric and categorical types of data. In [6]: # The values for the following features are missing and need to be removed frame1$glipizide.metformin <-NULL
frame1$glimepiride.pioglitazone <-NULL frame1$metformin.rosiglitazone <-NULL
frame1$metformin.pioglitazone <-NULL  In [8]: # upload some of the libraries library(ggplot2) library(dplyr)  #### Explotory Data Analysis¶ Plot the features to visualize the distribution of the two classes of the response variable. In [9]: ## Compare outcomes for various levels of factor (categorical) features. bar.plot <- function(x){ if(is.factor(frame1[, x])){ sums <- summary(frame1[, x], counts = n()) msk <- names(sums[which(sums > 100)]) # if a feature has less than 100 observations, it will not be displayed tmp <- frame1[frame1[, x] %in% msk, c('readmi_class', x)] if(strsplit(x, '[-]')[[1]][1] == x){ g <- ggplot(tmp, aes_string(x)) + geom_bar() + facet_grid(. ~ readmi_class) + ggtitle(paste('Readmissions by level of', x)) print(g) } } } cols <- names(frame1) cols <- cols[1:(length(cols) - 1)] #lapply(cols, bar.plot)  For majority of the categorical features, the distributions of both classes are almost the same. In [76]: ## Make box plots of the numeric columns violin.plot <- function(x){ if(is.numeric(frame1[, x])){ ggplot(frame1, aes_string('readmi_class', x)) + geom_violin(alpha=0.1) + ggtitle(paste('Readmissions by', x)) } }  In [1]: #lapply(names(frame1), violin.plot)  I am not going to explain each feature here, but it seems that difference between the readmitted and the non-readmitted classes seems to be more obvious looking at violin plot for some of th numeric features. The shape of the violin plots for number of inpatient vistis and the length of stay of patient (time in hospital) are worth noting. In [131]: # The followin feature are almost identical for the to classes, and they can be removed frame1$race <- NULL
frame1$weight <- NULL frame1$payer_code <- NULL
frame1$diag_1 <- NULL frame1$diag_2 <- NULL
frame1$diag_3 <- NULL frame1$acetohexamide <-NULL
frame1$chlorpropamide <- NULL frame1$acarbose <- NULL
frame1$miglitol <- NULL  ##### The Model¶ Now, let's build the logistic regression classification model. First, I will divide the data into training and testing datasets. Size the size of the data is large, I will use 60% of the data for training. In [132]: library(caTools) split=sample.split(frame1$readmi_class,SplitRatio = 0.6)

train<-frame1[split==TRUE, ]
test<-frame1[split==FALSE, ]

In [133]:
glm_model <- glm(readmi_class ~., data=train, family="binomial")

In [134]:
summary(glm_model)

Out[134]:
Call:
glm(formula = readmi_class ~ ., family = "binomial", data = train)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-3.7822  -1.0421  -0.8266   1.2059   1.8808

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                              1.061e+01  3.247e+02   0.033 0.973946
discharge_disposition_id                -1.272e-02  3.030e-03  -4.196 2.71e-05
time_in_hospital                         2.926e-01  7.863e-02   3.721 0.000198
num_lab_procedures                       1.504e-01  1.184e-01   1.270 0.204027
num_procedures                          -2.781e-01  5.951e-02  -4.672 2.98e-06
number_outpatient                        4.091e+00  6.037e-01   6.775 1.24e-11
number_emergency                         1.563e+01  2.210e+00   7.071 1.54e-12
number_inpatient                         7.333e+00  3.426e-01  21.404  < 2e-16
number_diagnoses                         1.318e+00  1.289e-01  10.224  < 2e-16
max_glu_serum>300                        3.350e-01  1.917e-01   1.747 0.080554
max_glu_serumNone                        4.107e-01  1.492e-01   2.753 0.005911
max_glu_serumNorm                        2.044e-01  1.653e-01   1.237 0.216135
A1Cresult>8                              3.216e-02  9.530e-02   0.337 0.735782
A1CresultNone                            4.713e-02  8.042e-02   0.586 0.557847
A1CresultNorm                           -1.866e-01  1.068e-01  -1.747 0.080609
metforminNo                              4.138e-01  1.916e-01   2.160 0.030769
metforminUp                              1.142e-01  2.435e-01   0.469 0.639146
repaglinideNo                            4.335e-02  6.481e-01   0.067 0.946677
repaglinideUp                           -5.899e-02  8.260e-01  -0.071 0.943068
nateglinideNo                           -1.275e+01  3.247e+02  -0.039 0.968687
nateglinideUp                           -1.324e+01  3.247e+02  -0.041 0.967475
glimepirideNo                            6.285e-01  4.319e-01   1.455 0.145612
glimepirideUp                            6.296e-01  5.110e-01   1.232 0.217863
glipizideNo                             -4.615e-01  2.142e-01  -2.155 0.031177
glipizideUp                             -4.676e-01  2.718e-01  -1.720 0.085370
glyburideNo                             -6.119e-02  2.091e-01  -0.293 0.769833
glyburideUp                              7.721e-02  2.591e-01   0.298 0.765651
pioglitazoneNo                          -5.002e-01  4.419e-01  -1.132 0.257631
pioglitazoneUp                           1.006e-01  5.222e-01   0.193 0.847212
rosiglitazoneNo                          7.171e-01  5.364e-01   1.337 0.181226
rosiglitazoneUp                          6.541e-01  6.423e-01   1.018 0.308532
tolazamideUp                             1.277e+01  3.247e+02   0.039 0.968634
insulinNo                               -9.712e-02  8.257e-02  -1.176 0.239529
insulinUp                               -6.856e-02  6.457e-02  -1.062 0.288306
glyburide.metforminNo                   -3.013e-01  1.448e+00  -0.208 0.835190
glyburide.metforminUp                   -1.331e+01  3.247e+02  -0.041 0.967318
changeNo                                 1.602e-02  5.939e-02   0.270 0.787368
diabetesMedYes                           2.764e-01  5.698e-02   4.852 1.22e-06
admission_type_descriptionTrauma Center -1.180e+01  1.607e+02  -0.073 0.941465
num_meds_classnormal                     2.057e-02  4.211e-02   0.489 0.625177

(Intercept)
discharge_disposition_id                ***
time_in_hospital                        ***
num_lab_procedures
num_procedures                          ***
number_outpatient                       ***
number_emergency                        ***
number_inpatient                        ***
number_diagnoses                        ***
max_glu_serum>300                       .
max_glu_serumNone                       **
max_glu_serumNorm
A1Cresult>8
A1CresultNone
A1CresultNorm                           .
metforminNo                             *
metforminUp
repaglinideNo
repaglinideUp
nateglinideNo
nateglinideUp
glimepirideNo
glimepirideUp
glipizideNo                             *
glipizideUp                             .
glyburideNo
glyburideUp
pioglitazoneNo
pioglitazoneUp
rosiglitazoneNo
rosiglitazoneUp
tolazamideUp
insulinNo
insulinUp
glyburide.metforminNo
glyburide.metforminUp
changeNo
diabetesMedYes                          ***
num_meds_classnormal
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 25282  on 18317  degrees of freedom
Residual deviance: 23851  on 18260  degrees of freedom
AIC: 23967

Number of Fisher Scoring iterations: 11


Note that looking at the p-values, many features are not statistically significant and can be removed without major effect on the predition accuracy.

##### Prediction on the testing dataset¶
In [135]:
test_predicted <- predict(glm_model, newdata = test, type = "response")


Now, lets plot the receiver operating characteristic (ROC) curve.

In [176]:
# First, install the ROCR package
install.packages("ROCR",repos='http://cran.us.r-project.org')

Installing package into 'C:/Users/Desta/Documents/R/win-library/3.1'
(as 'lib' is unspecified)
Warning message:
: package 'ROCR' is in use and will not be installed
In [177]:
#load the ROC package
library(ROCR)

In [178]:
ROCR_prediction <- prediction(test_predicted, test$readmi_class) ROCR_performace <- performance(ROCR_prediction, "tpr", "fpr") # Plot the ROC curve options(repr.plot.width = 10) options(repr.plot.height = 6) plot(ROCR_performace, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))  In [141]: # Calculate the area under the curve performance(ROCR_prediction, "auc")@y.values  Out[141]: 0.654 Given the messinessof readmission data, AUC of 0.65 is not bad. Let's the confusion matrix to see the prediction accuracy of the model for the readmitted patients. In [149]: # confusion matrix with cut-off value of 0.5 table(test$readmi_class, test_predicted>0.5)

Out[149]:
      FALSE TRUE
NO   5231 1353
YES  3352 2276
In [173]:
sensitivity = round((2276/(2276+3352)),2)

specificity = round((5231/(5231+1353)),2)

accuracy:  0.61
specifity:  0.79