dplyr Package Tutorial

dplyr package


The dplyr package is among the most important packages that make data analysis easier. This subsection use the dplyr package to analysis data.

In [4]:
frame <- read.csv("stevens.csv", stringsAsFactors= FALSE)
In [7]:
head(frame)
Out[7]:
DocketTermCircuitIssuePetitionerRespondentLowerCourtUnconstReverse
193-140819942ndEconomicActivityBUSINESSBUSINESSliberal01
293-157719949thEconomicActivityBUSINESSBUSINESSliberal01
393-161219945thEconomicActivityBUSINESSBUSINESSliberal01
494-62319941stEconomicActivityBUSINESSBUSINESSconser01
594-117519957thJudicialPowerBUSINESSBUSINESSconser01
695-12919959thEconomicActivityBUSINESSBUSINESSconser10
In [6]:
dim(frame)
Out[6]:

    566

    9

In [13]:
library(dplyr)

The five import functions in dplyr:

    Filter: keep rows matching criteria

    select: pick columns by name

    arrange: reorder rows

    mutate: add new variables

    summarize: reduces variables down to a single value

Structure

    First argument is a data frame

    Subsequent arguments say waht to do with the data frame

    Always return a data frame

Filter

In [18]:
# Filter
filter1 <- filter(frame, Respondent =="BUSINESS")
head(filter1)
Out[18]:
DocketTermCircuitIssuePetitionerRespondentLowerCourtUnconstReverse
193-140819942ndEconomicActivityBUSINESSBUSINESSliberal01
293-157719949thEconomicActivityBUSINESSBUSINESSliberal01
393-161219945thEconomicActivityBUSINESSBUSINESSliberal01
494-62319941stEconomicActivityBUSINESSBUSINESSconser01
594-117519957thJudicialPowerBUSINESSBUSINESSconser01
695-12919959thEconomicActivityBUSINESSBUSINESSconser10
In [21]:
# subset date where Term=1994 or 1997
filter2 <- filter(frame, Term %in% c(1994, 1997))
head(filter2)
Out[21]:
DocketTermCircuitIssuePetitionerRespondentLowerCourtUnconstReverse
193-140819942ndEconomicActivityBUSINESSBUSINESSliberal01
293-157719949thEconomicActivityBUSINESSBUSINESSliberal01
393-161219945thEconomicActivityBUSINESSBUSINESSliberal01
494-62319941stEconomicActivityBUSINESSBUSINESSconser01
596-176819979thEconomicActivityBUSINESSBUSINESSconser11
696-8431997DCEconomicActivityBUSINESSBUSINESSconser01
In [ ]:
In [23]:
# subset data with Term 1997 or later
filter3 <- filter(frame, Term >= 1997)
head(filter3)
Out[23]:
DocketTermCircuitIssuePetitionerRespondentLowerCourtUnconstReverse
196-176819979thEconomicActivityBUSINESSBUSINESSconser11
296-8431997DCEconomicActivityBUSINESSBUSINESSconser01
398-1480199911thCriminalProcedureBUSINESSBUSINESSconser01
499-15019992ndEconomicActivityBUSINESSBUSINESSconser01
599-157120006thEconomicActivityBUSINESSBUSINESSconser11
699-203520009thDueProcessBUSINESSBUSINESSliberal11
In [24]:
# Circuit is 9th and LowerCourt is liberal
filter4 <- filter(frame, Circuit=="9th" & LowerCourt=="liberal")
head(filter4)
Out[24]:
DocketTermCircuitIssuePetitionerRespondentLowerCourtUnconstReverse
193-157719949thEconomicActivityBUSINESSBUSINESSliberal01
299-203520009thDueProcessBUSINESSBUSINESSliberal11
395-24419959thJudicialPowerGOVERNMENT.OFFICIALBUSINESSliberal00
400-15220009thJudicialPowerGOVERNMENT.OFFICIALBUSINESSliberal01
593-131819949thFederalTaxationOTHERBUSINESSliberal01
695-8319959thJudicialPowerOTHERBUSINESSliberal01

Select

    starts_with(x, ignore.case = TRUE)

    ends_with(x, ignore.case = TRUE)

    contains(x, ...)

    matches(x, ...)

In [25]:
select1 <- select(frame, Term:Petitioner)
head(select1)
Out[25]:
TermCircuitIssuePetitioner
119942ndEconomicActivityBUSINESS
219949thEconomicActivityBUSINESS
319945thEconomicActivityBUSINESS
419941stEconomicActivityBUSINESS
519957thJudicialPowerBUSINESS
619959thEconomicActivityBUSINESS
In [27]:
select2 <- select(frame, starts_with("r"))
head(select2)
Out[27]:
RespondentReverse
1BUSINESS1
2BUSINESS1
3BUSINESS1
4BUSINESS1
5BUSINESS1
6BUSINESS0
In [28]:
# Get me the table whose variables ends with letter "t"
select3 <- select(frame, ends_with("t"))
head(select3)
Out[28]:
DocketCircuitRespondentLowerCourtUnconst
193-14082ndBUSINESSliberal0
293-15779thBUSINESSliberal0
393-16125thBUSINESSliberal0
494-6231stBUSINESSconser0
594-11757thBUSINESSconser0
695-1299thBUSINESSconser1
In [29]:
# Get me the table whose variables contains letter "t"
select4 <- select(frame, contains("t"))
head(select4)
Out[29]:
DocketTermCircuitPetitionerRespondentLowerCourtUnconst
193-140819942ndBUSINESSBUSINESSliberal0
293-157719949thBUSINESSBUSINESSliberal0
393-161219945thBUSINESSBUSINESSliberal0
494-62319941stBUSINESSBUSINESSconser0
594-117519957thBUSINESSBUSINESSconser0
695-12919959thBUSINESSBUSINESSconser1

Arrange

In [31]:
# Arrange the table in (alphabetically) descending order by Issue
arrange1 <- arrange(frame, Issue)
head(arrange1)
Out[31]:
DocketTermCircuitIssuePetitionerRespondentLowerCourtUnconstReverse
194-226199411thAttorneysBUSINESSOTHERliberal00
299-184820004thAttorneysBUSINESSOTHERconser01
399-50219996thAttorneysOTHEROTHERconser01
497-1943199810thCivilRightsEMPLOYEEBUSINESSconser01
500-185320012ndCivilRightsEMPLOYEEBUSINESSconser01
695-187219968thCivilRightsAMERICAN.INDIANBUSINESSconser00
In [37]:
# order the data by Term, descending order
arrange2 <- arrange(frame, desc(Term))
head(arrange2)
Out[37]:
DocketTermCircuitIssuePetitionerRespondentLowerCourtUnconstReverse
100-185320012ndCivilRightsEMPLOYEEBUSINESSconser01
200-130720014thCivilRightsOTHERBUSINESSconser01
300-92720015thUnionsOTHERBUSINESSconser01
400-124920017thFirstAmendmentBUSINESSCITYconser10
500-86020012ndJudicialPowerBUSINESSCRIMINAL.DEFENDENTliberal00
600-85320012ndDueProcessOTHERCRIMINAL.DEFENDENTliberal01
In [38]:
# order the data by Term, ascending order
arrange3 <- arrange(frame, Term)
head(arrange3)
Out[38]:
DocketTermCircuitIssuePetitionerRespondentLowerCourtUnconstReverse
193-140819942ndEconomicActivityBUSINESSBUSINESSliberal01
293-157719949thEconomicActivityBUSINESSBUSINESSliberal01
393-161219945thEconomicActivityBUSINESSBUSINESSliberal01
494-62319941stEconomicActivityBUSINESSBUSINESSconser01
594-78819947thEconomicActivityCITYBUSINESSliberal00
693-125119946thJudicialPowerOTHERBUSINESSliberal01

Mutate

Variable transformation

In [53]:
# Add additional variable called 'Reverse10' from Reverse
frame1 <- mutate(frame, Reverse10 = round(Reverse*runif(20)*100, 0))
head(frame1)
Warning message:
In c(1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, : longer object length is not a multiple of shorter object length
Out[53]:
DocketTermCircuitIssuePetitionerRespondentLowerCourtUnconstReverseReverse10
193-140819942ndEconomicActivityBUSINESSBUSINESSliberal0164
293-157719949thEconomicActivityBUSINESSBUSINESSliberal0139
393-161219945thEconomicActivityBUSINESSBUSINESSliberal0182
494-62319941stEconomicActivityBUSINESSBUSINESSconser0175
594-117519957thJudicialPowerBUSINESSBUSINESSconser0136
695-12919959thEconomicActivityBUSINESSBUSINESSconser100

Group_by and Summarise

Summary functions:

    min(x0, median(x), mean(), max(x), sum(x), quantile(x, p), n(), n_distinct(x), sd(x), var(x), IQR(x), mad(x)

    sum(x > 10), mean(x > 10)

In [55]:
# Group the data by Term and get sum of Reverse10 for each Term
by_Term <- group_by(frame1, Term)
summary1 <- summarise(by_Term, total = sum(Reverse10))
head(summary1)
Out[55]:
Termtotal
119942370
219952218
319962191
419972660
519981939
619992119
In [56]:
summary2 <- summarise(filter(by_Term, !is.na(Reverse10)),
                     med = median(Reverse10),
                     mean = mean(Reverse10),
                     max = max(Reverse10),
                     Q90 = quantile(Reverse10, 0.9))
head(summary2)
Out[56]:
TermmedmeanmaxQ90
119947.529.6259375.2
2199514.530.805569375
319961127.049389375
419973630.930239376
519987.525.513169372
619991129.845079382
In [ ]: