Multi-class Classification in R

Prediction of Claims

The cost of classifying a high-risk patient is always higher than the cost predicting a low-risk patient as a high-risk patient. This claims data for groups of people in the Medicare program is used to predict hospital costs. This data is publicly available in the US Center for Medicare and Medicaid Services (CMS) to help develop prediction models to predict hospital costs.

In [2]:
claims <- read.csv("ClaimsData.csv")
In [3]:
str(claims)
'data.frame':	458005 obs. of  16 variables:
 $ age              : int  85 59 67 52 67 68 75 70 67 67 ...
 $ alzheimers       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ arthritis        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cancer           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ copd             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ depression       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ diabetes         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ heart.failure    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ihd              : int  0 0 0 0 0 0 0 0 0 0 ...
 $ kidney           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ osteoporosis     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ stroke           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ reimbursement2008: int  0 0 0 0 0 0 0 0 0 0 ...
 $ bucket2008       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ reimbursement2009: int  0 0 0 0 0 0 0 0 0 0 ...
 $ bucket2009       : int  1 1 1 1 1 1 1 1 1 1 ...

The classification tree model is build using the data from 2008 and tested using the data from 2009.

In [8]:
library(caTools)
set.seed(88)
spl <- sample.split(claims$bucket2009, SplitRatio = 0.6)
claimsTrain <- subset(claims, spl==TRUE)
claimsTest <- subset(claims, spl==FALSE)
In [9]:
table(claimsTest$bucket2009, claimsTest$bucket2008)
Out[9]:
   
         1      2      3      4      5
  1 110138   7787   3427   1452    174
  2  16000  10721   4629   2931    559
  3   7006   4629   2774   1621    360
  4   2688   1943   1415   1539    352
  5    293    191    160    309    104

Assuming the cost of claim is the same in 2008 and 2009, the accuracy of the baseline method on the test set is given as follows:

In [12]:
Accuracy_baseline <- diag(table(claimsTest$bucket2009, claimsTest$bucket2008))/sum(table(claimsTest$bucket2009, claimsTest$bucket2008))
round(sum(Accuracy_baseline), 3)
Out[12]:
0.684

The cost of classifying a high-risk patient is always higher than the cost predicting a low-risk patient as a high-risk patient. Let's introduce a penalty matrix to give higher penalties for wrongly predicting a high-risk patient.

In [15]:
penaltyMatrix <- matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow=TRUE, nrow=5)
penaltyMatrix
Out[15]:
01234
20123
42012
64201
86420

The penalty error of the baseline model would be calculated by:

In [19]:
ErrorCost_matrix <- as.matrix(table(claimsTest$bucket2009, claimsTest$bucket2008))*penaltyMatrix
ErrorCost_matrix
Out[19]:
   
        1     2     3     4     5
  1     0  7787  6854  4356   696
  2 32000     0  4629  5862  1677
  3 28024  9258     0  1621   720
  4 16128  7772  2830     0   352
  5  2344  1146   640   618     0
In [26]:
PenaltyError_baseline <- sum(ErrorCost_matrix)/nrow(claimsTest)
round(PenaltyError_baseline, 3)
Out[26]:
0.739

Now, the goal of the classification problem should be to improve the accuracy level from the baseline accuracy of 0.684 and to reduce the penalty error rate from 0.739.

In [27]:
library(rpart)
#library(rpart.plot)
In [28]:
str(claims)
'data.frame':	458005 obs. of  16 variables:
 $ age              : int  85 59 67 52 67 68 75 70 67 67 ...
 $ alzheimers       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ arthritis        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cancer           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ copd             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ depression       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ diabetes         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ heart.failure    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ihd              : int  0 0 0 0 0 0 0 0 0 0 ...
 $ kidney           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ osteoporosis     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ stroke           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ reimbursement2008: int  0 0 0 0 0 0 0 0 0 0 ...
 $ bucket2008       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ reimbursement2009: int  0 0 0 0 0 0 0 0 0 0 ...
 $ bucket2009       : int  1 1 1 1 1 1 1 1 1 1 ...
In [36]:
claimsTree <- rpart(bucket2009~. -reimbursement2009, data=claimsTrain, method = "class", 
                    cp=0.00005, parms=list(loss=penaltyMatrix))
#prp(claimsTree)
In [37]:
predictTest <- predict(claimsTree, newdata=claimsTest, type="class")
table(claimsTest$bucket2009, predictTest)
Out[37]:
   predictTest
        1     2     3     4     5
  1 94310 25295  3087   286     0
  2  7176 18942  8079   643     0
  3  3590  7706  4692   401     1
  4  1304  3193  2803   636     1
  5   135   356   408   156     2
In [38]:
AccuracyTest <- diag(table(claimsTest$bucket2009, predictTest))/sum(table(claimsTest$bucket2009, predictTest))
round(sum(AccuracyTest),3)
Out[38]:
0.647
In [39]:
# penalty error of the model
ErrorCost_rate <- sum(as.matrix(table(claimsTest$bucket2009, predictTest))*penaltyMatrix)/nrow(claimsTest)
round(ErrorCost_rate,3)
Out[39]:
0.642

The accuracy of the model is lower, but it has a lower error rate of predicting the high-risk patients.