Text Mining and Sentiment Analysis Using R: Analysis of 2016 Presidential Candidates

Text Mining and Sentiment Analysis Using R: Analysis of 2016 Presidential Candidates

This post analyzes sentiment of tweets towards the 2016 presidatial candidate front runners using R. CART and random forest classification models are also developed to classify tweets as Positive, Neutral, or Negative.Tweets containing the names of Bernie Sanders, Donald Trump, Hillary Clinton, and Ted Cruz were mined and analyzed.

Let's load the required packages.

In [1]:
library(twitteR)
library(ROAuth)
require(RCurl)
library(stringr)
library(tm)
library(plyr)
library(tm)
library(wordcloud)

In [2]:
consumer_key = "consumer key"
consumer_secret = "consumer secret"
token_secret = "token secret"
access_token = "access token"


# set working directory for the whole process-

setwd("C:/MOOC/Text Mining")


download.file(url="http://curl.haxx.se/ca/cacert.pem",
              destfile = "C:/MOOC/Text Mining/cacert.pem",
              method = "auto")



authenticate <- OAuthFactory$new(consumerKey = consumer_key,
                                 consumerSecret = consumer_secret,
                                 requestURL="https://api.twitter.com/oauth/request_token",
                                 accessURL="https://api.twitter.com/oauth/access_token",
                                 authURL="https://api.twitter.com/oauth/authorize")


setup_twitter_oauth(consumer_key, consumer_secret, access_token, token_secret)



save(authenticate, file="twitter authentication.Rdata")
[1] "Using direct authentication"

Web Scraping

Let's search what has been tweeted about the two front runners from the democratic and republican parties. We will analyze who has been the center of the tweets and the semtiments of the tweets.

In [4]:
Hillary <- searchTwitter("hillary + clinton", n=1000, lang='en', since='2015-10-01', until='2015-12-20')
Donald <- searchTwitter("donald + trump", n=1000, lang='en', since='2015-10-01', until='2015-12-20')
Bernie <- searchTwitter("bernie + sanders", n=1000, lang='en', since='2015-10-01', until='2015-12-20')
Ted <- searchTwitter("ted + cruz", n=1000, lang='en', since='2015-10-01', until='2015-12-20')
In [5]:
#Let's get text
hillary_txt <- sapply(Hillary, function(x) x$getText())
donald_txt <- sapply(Donald, function(x) x$getText())
bernie_txt <- sapply(Bernie, function(x) x$getText())
ted_txt <- sapply(Ted, function(x) x$getText())
In [6]:
#Number of tweets
NumTweets <- c(length(hillary_txt), length(donald_txt), length(bernie_txt), length(ted_txt))

Let's combine the tweets for each candidate.

In [7]:
tweets <- c(hillary_txt, donald_txt, bernie_txt, ted_txt)

#remove the non-alpha-numeric characters
tweets <- sapply(tweets,function(x) iconv(x, "latin1", "ASCII", sub=""))

Wordcloud

In [8]:
tweetCorpus <- Corpus(VectorSource(tweets))

#remove punctuation marks
tweetsCorpus <- tm_map(tweetCorpus, removePunctuation)

#convert text to lower case
tweetsCorpus <- tm_map(tweetsCorpus, tolower)

#remove stopwords
tweetsCorpus=tm_map(tweetsCorpus,function(x) removeWords(x,stopwords()))

#remove white spaces
tweetsCorpus <- tm_map(tweetsCorpus, stripWhitespace)

#transform to text which wordcloud can use
tweet_dtm <- tm_map(tweetsCorpus, PlainTextDocument)

#making a document-term matrix
dtm <- DocumentTermMatrix(tweet_dtm)

dtm2 <- as.matrix(dtm)
freq <- colSums(dtm2)
freq <- sort(freq, decreasing = TRUE)
In [9]:
head(freq, 10)
Out[9]:
sanders
1092
trump
1047
donald
981
cruz
894
ted
835
hillary
757
clinton
742
bernie
709
campaign
479
data
453

The names of the candidates are the most popular words. To focus on the contents of the tweets, let's remove the names of the candidates from appearing in the wordcloud.

In [10]:
tweetCorpus <- Corpus(VectorSource(tweets))

#remove punctuation marks
tweetsCorpus <- tm_map(tweetCorpus, removePunctuation)

#convert text to lower case
tweetsCorpus <- tm_map(tweetsCorpus, tolower)

#remove stopwords
tweetsCorpus <- tm_map(tweetsCorpus, removeWords, c("bernie", "sanders", "trump", "donald", "cruz", "ted", "hillary", "clinton", stopwords("english")))

#remove white spaces
tweetsCorpus <- tm_map(tweetsCorpus, stripWhitespace)

#transform to text which wordcloud can use
tweet_dtm <- tm_map(tweetsCorpus, PlainTextDocument)

#making a document-term matrix
dtm <- DocumentTermMatrix(tweet_dtm)

dtm2 <- as.matrix(dtm)
freq <- colSums(dtm2)
freq <- sort(freq, decreasing = TRUE)

Let's create a wordcloud

In [22]:
#library(wordcloud)
pal = brewer.pal(8,"Dark2")
#pal = pal[-(1:4)]


words <- names(freq)
#wordcloud(words[1:50], freq[1:60], col= rainbow(100), random.order = F, scale = c(5,1))
wordcloud(words[1:50], freq[1:60], colors= pal)

It seems that Bernie Sanders was the center of the tweets recently. In particular, the data breach of the DNC database was the center of attention.

Sentiment Analysis

Let's apply the lexicon based sentiment analysis approach which was first proposed in (Hu and Liu, KDD-2004). More description about the method can be found here.

In [13]:
pos <- readLines("positive-words.txt")
neg <- readLines("negative-words.txt")
In [14]:
score.sentiment <- function(sentences, pos.words, neg.words, .progress='none'){

scores <- laply(sentences,
function(sentence, pos.words, neg.words){

    #remove punctuation - using global substitute
sentence <- gsub("[[:punct:]]", "", sentence)

    #remove control characters
sentence <- gsub("[[:cntrl:]]", "", sentence)

    #remove digits
sentence <- gsub('\\d+', '', sentence)

    #define error handling function when trying tolower
tryTolower <- function(x){

    #create missing value
y <- NA

    #tryCatch error
try_error <- tryCatch(tolower(x), error=function(e) e)

    #if not an error
if (!inherits(try_error, "error"))
y <- tolower(x)

    #result
return(y)
}

    #use tryTolower with sapply
sentence <- sapply(sentence, tryTolower)

    #split sentence into words with str_split (stringr package)
word.list <- str_split(sentence, "\\s+")
#word.list <- str_replace_all(word.list,"[^[:graph:]]", " ")
words <- unlist(word.list)


    #remove non-alphabetic characters
#alpha_num <- grep(words, pattern = "[a-z|0-9]", ignore.case = T)
#words <- paste(words[alpha_num], collapse = " ")
#words <- sapply(words,function(row) iconv(row, "latin1", "ASCII", sub=""))

    #compare words to the dictionaries of positive & negative terms
pos.matches <- match(words, pos.words)
neg.matches <- match(words, neg.words)

    #get the position of the matched term or NA
    #we just want a TRUE/FALSE
pos.matches <- !is.na(pos.matches)
neg.matches <- !is.na(neg.matches)

    #final score
score <- sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )

    #data frame with scores for each sentence
scores.df <- data.frame(text=sentences, score=scores)
return(scores.df)
}
In [15]:
#apply function score.sentiment
scores <- score.sentiment(tweets, pos, neg, .progress='text')
  |======================================================================| 100%
In [16]:
#add variables to a data frame
scores$Candidate = factor(rep(c("Hillary", "Donald", "Bernie", "Ted"), NumTweets))
scores$very.pos = as.numeric(scores$score >= 2)
scores$very.neg = as.numeric(scores$score <= -2)
In [17]:
#how many very positives and very negatives
numpos <- sum(scores$very.pos)
numneg <- sum(scores$very.neg)
#global score
global_score = paste0(round(100 * numpos / (numpos + numneg)),"%")
global_score
Out[17]:
"72%"

Generally speaking, the tweets have quite positive sentiments.

Now, let's compare which candidate has positive tweets and which has negative posts.

In [18]:
boxplot(score~Candidate, data=scores, col='blue')

Unlike his recent controversial remarks, presidential candidate Donald Trump has the most positive comments while the sentiment towards the democrats is negative. The following histogram also convey the same message.

In [137]:
library("lattice")
histogram(data=scores, ~score|Candidate, main="Sentiment Analysis of the four presidential candidates", xlab="", sub="Sentiment Score")

Summary

The names of Bernie Sanders and Donald Trump were the most common words collected from the 4,000 tweets. Generally speaking, the tweets conveyed positive sentiment with total score of 72%.

Looking at the individual presidential candidates, the sentiment towards the republican front runners was positive while that of democrat candidates was negative. According to the sentiment analysis, weets involving Donald Trump has the most positive sentiment.