## Multi-class Classification in R¶

### Prediction of Claims¶

The cost of classifying a high-risk patient is always higher than the cost predicting a low-risk patient as a high-risk patient. This claims data for groups of people in the Medicare program is used to predict hospital costs. This data is publicly available in the US Center for Medicare and Medicaid Services (CMS) to help develop prediction models to predict hospital costs.

In [2]:
claims <- read.csv("ClaimsData.csv")

In [3]:
str(claims)

'data.frame':	458005 obs. of  16 variables:
$age : int 85 59 67 52 67 68 75 70 67 67 ...$ alzheimers       : int  0 0 0 0 0 0 0 0 0 0 ...
$arthritis : int 0 0 0 0 0 0 0 0 0 0 ...$ cancer           : int  0 0 0 0 0 0 0 0 0 0 ...
$copd : int 0 0 0 0 0 0 0 0 0 0 ...$ depression       : int  0 0 0 0 0 0 0 0 0 0 ...
$diabetes : int 0 0 0 0 0 0 0 0 0 0 ...$ heart.failure    : int  0 0 0 0 0 0 0 0 0 0 ...
$ihd : int 0 0 0 0 0 0 0 0 0 0 ...$ kidney           : int  0 0 0 0 0 0 0 0 0 0 ...
$osteoporosis : int 0 0 0 0 0 0 0 0 0 0 ...$ stroke           : int  0 0 0 0 0 0 0 0 0 0 ...
$reimbursement2008: int 0 0 0 0 0 0 0 0 0 0 ...$ bucket2008       : int  1 1 1 1 1 1 1 1 1 1 ...
$reimbursement2009: int 0 0 0 0 0 0 0 0 0 0 ...$ bucket2009       : int  1 1 1 1 1 1 1 1 1 1 ...


The classification tree model is build using the data from 2008 and tested using the data from 2009.

In [8]:
library(caTools)
set.seed(88)
spl <- sample.split(claims$bucket2009, SplitRatio = 0.6) claimsTrain <- subset(claims, spl==TRUE) claimsTest <- subset(claims, spl==FALSE)  In [9]: table(claimsTest$bucket2009, claimsTest$bucket2008)  Out[9]:  1 2 3 4 5 1 110138 7787 3427 1452 174 2 16000 10721 4629 2931 559 3 7006 4629 2774 1621 360 4 2688 1943 1415 1539 352 5 293 191 160 309 104 Assuming the cost of claim is the same in 2008 and 2009, the accuracy of the baseline method on the test set is given as follows: In [12]: Accuracy_baseline <- diag(table(claimsTest$bucket2009, claimsTest$bucket2008))/sum(table(claimsTest$bucket2009, claimsTest$bucket2008)) round(sum(Accuracy_baseline), 3)  Out[12]: 0.684 The cost of classifying a high-risk patient is always higher than the cost predicting a low-risk patient as a high-risk patient. Let's introduce a penalty matrix to give higher penalties for wrongly predicting a high-risk patient. In [15]: penaltyMatrix <- matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow=TRUE, nrow=5) penaltyMatrix  Out[15]:  0 1 2 3 4 2 0 1 2 3 4 2 0 1 2 6 4 2 0 1 8 6 4 2 0 The penalty error of the baseline model would be calculated by: In [19]: ErrorCost_matrix <- as.matrix(table(claimsTest$bucket2009, claimsTest$bucket2008))*penaltyMatrix ErrorCost_matrix  Out[19]:  1 2 3 4 5 1 0 7787 6854 4356 696 2 32000 0 4629 5862 1677 3 28024 9258 0 1621 720 4 16128 7772 2830 0 352 5 2344 1146 640 618 0 In [26]: PenaltyError_baseline <- sum(ErrorCost_matrix)/nrow(claimsTest) round(PenaltyError_baseline, 3)  Out[26]: 0.739 Now, the goal of the classification problem should be to improve the accuracy level from the baseline accuracy of 0.684 and to reduce the penalty error rate from 0.739. In [27]: library(rpart) #library(rpart.plot)  In [28]: str(claims)  'data.frame': 458005 obs. of 16 variables:$ age              : int  85 59 67 52 67 68 75 70 67 67 ...
$alzheimers : int 0 0 0 0 0 0 0 0 0 0 ...$ arthritis        : int  0 0 0 0 0 0 0 0 0 0 ...
$cancer : int 0 0 0 0 0 0 0 0 0 0 ...$ copd             : int  0 0 0 0 0 0 0 0 0 0 ...
$depression : int 0 0 0 0 0 0 0 0 0 0 ...$ diabetes         : int  0 0 0 0 0 0 0 0 0 0 ...
$heart.failure : int 0 0 0 0 0 0 0 0 0 0 ...$ ihd              : int  0 0 0 0 0 0 0 0 0 0 ...
$kidney : int 0 0 0 0 0 0 0 0 0 0 ...$ osteoporosis     : int  0 0 0 0 0 0 0 0 0 0 ...
$stroke : int 0 0 0 0 0 0 0 0 0 0 ...$ reimbursement2008: int  0 0 0 0 0 0 0 0 0 0 ...
$bucket2008 : int 1 1 1 1 1 1 1 1 1 1 ...$ reimbursement2009: int  0 0 0 0 0 0 0 0 0 0 ...
$bucket2009 : int 1 1 1 1 1 1 1 1 1 1 ...  In [36]: claimsTree <- rpart(bucket2009~. -reimbursement2009, data=claimsTrain, method = "class", cp=0.00005, parms=list(loss=penaltyMatrix)) #prp(claimsTree)  In [37]: predictTest <- predict(claimsTree, newdata=claimsTest, type="class") table(claimsTest$bucket2009, predictTest)

Out[37]:
   predictTest
1     2     3     4     5
1 94310 25295  3087   286     0
2  7176 18942  8079   643     0
3  3590  7706  4692   401     1
4  1304  3193  2803   636     1
5   135   356   408   156     2
In [38]:
AccuracyTest <- diag(table(claimsTest$bucket2009, predictTest))/sum(table(claimsTest$bucket2009, predictTest))
round(sum(AccuracyTest),3)

Out[38]:
0.647
In [39]:
# penalty error of the model
ErrorCost_rate <- sum(as.matrix(table(claimsTest\$bucket2009, predictTest))*penaltyMatrix)/nrow(claimsTest)
round(ErrorCost_rate,3)

Out[39]:
0.642

The accuracy of the model is lower, but it has a lower error rate of predicting the high-risk patients.