Predicting the Sentiment of Tweets

Sentiment Analysis and Predicting the Sentiment of Tweets Using R: Analysis of 2016 Presidential Candidates

This post analyzes sentiment of tweets towards the 2016 presidential candidate front runners using R. Tweets containing the names of Bernie Sanders, Donald Trump, Hillary Clinton, and Ted Cruz were mined and analyzed.

Let's load the required packages.

In [1]:
library(twitteR)
library(ROAuth)
require(RCurl)
library(stringr)
library(tm)
library(plyr)
library(tm)
library(wordcloud)
Loading required package: RCurl
Loading required package: bitops
Loading required package: NLP

Attaching package: 'plyr'

The following object is masked from 'package:twitteR':

    id

Loading required package: RColorBrewer
In [2]:
consumer_key = "consumer_key"
consumer_secret = "consumer_secret"
token_secret = "token_secret"
access_token = "access_token"


# set working directory for the whole process-

setwd("C:/MOOC/Text Mining")


download.file(url="http://curl.haxx.se/ca/cacert.pem",
              destfile = "C:/MOOC/Text Mining/cacert.pem",
              method = "auto")



authenticate <- OAuthFactory$new(consumerKey = consumer_key,
                                 consumerSecret = consumer_secret,
                                 requestURL="https://api.twitter.com/oauth/request_token",
                                 accessURL="https://api.twitter.com/oauth/access_token",
                                 authURL="https://api.twitter.com/oauth/authorize")


setup_twitter_oauth(consumer_key, consumer_secret, access_token, token_secret)



save(authenticate, file="twitter authentication.Rdata")
[1] "Using direct authentication"

Web Scraping

Let's search what has been tweeted about the two front runners from the democratic and republican parties. We will analyze who has been the center of the tweets and the semtiments of the tweets.

In [3]:
Hillary <- searchTwitter("hillary + clinton", n=1000, lang='en', since='2015-10-01', until='2016-01-01')
Donald <- searchTwitter("donald + trump", n=1000, lang='en', since='2015-10-01', until='2016-01-01')
Bernie <- searchTwitter("bernie + sanders", n=1000, lang='en', since='2015-10-01', until='2016-01-01')
Ted <- searchTwitter("ted + cruz", n=1000, lang='en', since='2015-10-01', until='2016-01-01')
In [4]:
#Let's get text
hillary_txt <- sapply(Hillary, function(x) x$getText())
donald_txt <- sapply(Donald, function(x) x$getText())
bernie_txt <- sapply(Bernie, function(x) x$getText())
ted_txt <- sapply(Ted, function(x) x$getText())
In [5]:
#Number of tweets
NumTweets <- c(length(hillary_txt), length(donald_txt), length(bernie_txt), length(ted_txt))

Let's combine the tweets for each candidate.

In [6]:
tweets <- c(hillary_txt, donald_txt, bernie_txt, ted_txt)

#remove the non-alpha-numeric characters
tweets <- sapply(tweets,function(x) iconv(x, "latin1", "ASCII", sub=""))
In [7]:
head(tweets)
Out[7]:
Donald Trump Declares 'War' on Hillary Clinton, Jeb Bush – ABC News https://t.co/88i3Y14kbE
"Donald Trump Declares 'War' on Hillary Clinton, Jeb Bush ABCNews https://t.co/88i3Y14kbE"
RT @KurtSchlichter: Why is Democrat Bill Cosby OK to prosecute, but Democrat Bill Clinton and his enabler Democrat Hillary Clinton are not?
"RT @KurtSchlichter: Why is Democrat Bill Cosby OK to prosecute, but Democrat Bill Clinton and his enabler Democrat Hillary Clinton are not?"
RT @mashable: 3 things we learned from the latest Hillary Clinton email dump https://t.co/nPgt7NSmd7
"RT @mashable: 3 things we learned from the latest Hillary Clinton email dump https://t.co/nPgt7NSmd7"
Retweeted Steve Me (@SteveJurevicius): Firms Paid Bill Clinton Millions As They Lobbied Hillary Clinton... https://t.co/STnICPRuWk
"Retweeted Steve Me (@SteveJurevicius): Firms Paid Bill Clinton Millions As They Lobbied Hillary Clinton... https://t.co/STnICPRuWk"
Retweeted Steve Me (@SteveJurevicius): Firms Paid Bill Clinton Millions As They Lobbied Hillary Clinton... https://t.co/XS9UhP0N2T
"Retweeted Steve Me (@SteveJurevicius): Firms Paid Bill Clinton Millions As They Lobbied Hillary Clinton... https://t.co/XS9UhP0N2T"
RT @gerfingerpoken: Hillary Clinton waffles on #Keystone XL her State Dept OKd - American Thinker https://t.co/6LX35sOTDb - https://t.co/6…
"RT @gerfingerpoken: Hillary Clinton waffles on #Keystone XL her State Dept OKd - American Thinker https://t.co/6LX35sOTDb - https://t.co/6"

Sentiment Analysis

Let's apply the lexicon based sentiment analysis approach which was first proposed in (Hu and Liu, KDD-2004). More description about the method can be found here.

In [8]:
pos <- readLines("positive-words.txt")
neg <- readLines("negative-words.txt")
In [9]:
score.sentiment <- function(sentences, pos.words, neg.words, .progress='none'){

scores <- laply(sentences,
function(sentence, pos.words, neg.words){

    #remove punctuation - using global substitute
sentence <- gsub("[[:punct:]]", "", sentence)

    #remove control characters
sentence <- gsub("[[:cntrl:]]", "", sentence)

    #remove digits
sentence <- gsub('\\d+', '', sentence)

    #define error handling function when trying tolower
tryTolower <- function(x){

    #create missing value
y <- NA

    #tryCatch error
try_error <- tryCatch(tolower(x), error=function(e) e)

    #if not an error
if (!inherits(try_error, "error"))
y <- tolower(x)

    #result
return(y)
}

    #use tryTolower with sapply
sentence <- sapply(sentence, tryTolower)

    #split sentence into words with str_split (stringr package)
word.list <- str_split(sentence, "\\s+")
#word.list <- str_replace_all(word.list,"[^[:graph:]]", " ")
words <- unlist(word.list)


    #remove non-alphabetic characters
#alpha_num <- grep(words, pattern = "[a-z|0-9]", ignore.case = T)
#words <- paste(words[alpha_num], collapse = " ")
#words <- sapply(words,function(row) iconv(row, "latin1", "ASCII", sub=""))

    #compare words to the dictionaries of positive & negative terms
pos.matches <- match(words, pos.words)
neg.matches <- match(words, neg.words)

    #get the position of the matched term or NA
    #we just want a TRUE/FALSE
pos.matches <- !is.na(pos.matches)
neg.matches <- !is.na(neg.matches)

    #final score
score <- sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )

    #data frame with scores for each sentence
scores.df <- data.frame(text=sentences, score=scores)
return(scores.df)
}
In [10]:
#apply function score.sentiment
scores <- score.sentiment(tweets, pos, neg, .progress='text')
  |======================================================================| 100%
In [11]:
#add variables to a data frame
scores$Candidate = factor(rep(c("Hillary", "Donald", "Bernie", "Ted"), NumTweets))
scores$very.pos = as.numeric(scores$score >= 2)
scores$very.neg = as.numeric(scores$score <= -2)
In [12]:
#how many very positives and very negatives
numpos <- sum(scores$very.pos)
numneg <- sum(scores$very.neg)
#global score
global_score = paste0(round(100 * numpos / (numpos + numneg)),"%")
global_score
Out[12]:
"84%"

Now, let's compare which candidate has positive tweets and which has negative posts.

In [13]:
boxplot(score~Candidate, data=scores, col='blue')

Unlike his recent controvesial remarks, presidential candidate Donald Trump has the most positive comments while the sentiment towards Hillary Clinton is negative. The following histogram also convey the same message.

In [14]:
library("lattice")
histogram(data=scores, ~score|Candidate, main="Sentiment Analysis of the four presidential candidates", xlab="", sub="Sentiment Score")

Conclusion

Generally speaking, the tweets conveyed positive sentiment with total score of 84%.

Looking at the individual presidential candidates, the sentiment towards the republican front runners and democratic candidate Bernie Sanders was positive while that of Hillary Clinton was negative. According to the sentiment analysis, tweets involving Donald Trump has the most positive sentiment.

Predicting the Sentiment of Tweets

Now that we have the tweets, let's predict their sentiments. The objective is to classify to tweets as Positive, Neutral, or Negative.

In [ ]:
# Load the required packages
install.packages("SnowballC",repos='http://cran.us.r-project.org')
install.packages("rpart.plot", repos='http://cran.us.r-project.org')
install.packages("ROCR", repos='http://cran.us.r-project.org')
install.packages('randomForest', repos='http://cran.us.r-project.org')
In [15]:
library(SnowballC)
library(rpart.plot)
library(ROCR)
library(randomForest)
Loading required package: rpart
Loading required package: gplots

Attaching package: 'gplots'

The following object is masked from 'package:wordcloud':

    textplot

The following object is masked from 'package:stats':

    lowess

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

We are going to use the tweets containing the four candidates. We are going to use some of the above steps to clean the data.

In [16]:
tweetCorpus <- Corpus(VectorSource(tweets))

#remove punctuation marks
tweetsCorpus <- tm_map(tweetCorpus, removePunctuation)

#convert text to lower case
tweetsCorpus <- tm_map(tweetsCorpus, tolower)

#remove stopwords
tweetsCorpus <- tm_map(tweetsCorpus, removeWords, c("bernie", "sanders", "trump", "donald", "cruz", "ted", "hillary", "clinton", stopwords("english")))

#remove white spaces
tweetsCorpus <- tm_map(tweetsCorpus, stripWhitespace)

#transform to text which wordcloud can use
tweet_dtm <- tm_map(tweetsCorpus, PlainTextDocument)
In [17]:
#making a document-term matrix - this create a matrix with a row representing each tweets and a column
terms <- DocumentTermMatrix(tweet_dtm)
terms
Out[17]:
<<DocumentTermMatrix (documents: 4000, terms: 6831)>>
Non-/sparse entries: 34351/27289649
Sparsity           : 100%
Maximal term length: 66
Weighting          : term frequency (tf)

Sparsity indicates the number of common words in the tweets. Higher sparsity means the correlation among the tweets in low (there are many zeros in the text document matrix).

Because the number of terms indicates the number of columns in our document. Let's see the most common words and remove the less frequent words

In [18]:
length(findFreqTerms(terms, lowfreq=30)) # this find the words that appears at least 30 times
Out[18]:
186
In [19]:
#let's remove the sparse terms
sparseTerms <- removeSparseTerms(terms, 0.995)
sparseTerms
Out[19]:
<<DocumentTermMatrix (documents: 4000, terms: 299)>>
Non-/sparse entries: 17617/1178383
Sparsity           : 99%
Maximal term length: 18
Weighting          : term frequency (tf)

This has reduce the number of terms from 6826 to just 299 terms. Now, let's convert the sparse matrix into a data frame.

In [20]:
dataframe <- as.data.frame(as.matrix(sparseTerms))

Since some of the words in the tweets may start with a number, let's make sure the column names are in the proper format.

In [21]:
colnames(dataframe) <- make.names(colnames(dataframe))

Now, let's get the sentiment of each tweets from the scores.sentiment function.

In [22]:
dataframe$Negative <- as.factor(scores$score <=-1)
In [23]:
dataframe$score <- NULL
dataframe$Score <-NULL

The Predictive Models

We will use CART and logistic regression to predict negative sentiment.

Let's split the data into training and testing datasets.

In [24]:
set.seed(1000)
library(caTools)
In [25]:
split <- sample.split(dataframe$Negative, SplitRatio=0.7)
trainData <- subset(dataframe, split==TRUE)
testData <- subset(dataframe, split==FALSE)
In [26]:
modelCART <- rpart(Negative ~., data=trainData, method="class")
prp(modelCART)

From the model, tweets containing the words: dump, falls, hate, lair, attack, conservative, and politica convey negative sentiments.

Now, let's make prediction on the test dataset.

In [27]:
#make prediction
predictCART <- predict(modelCART, newdata = testData, type="class")
table(testData$Negative, predictCART)
Out[27]:
       predictCART
        FALSE TRUE
  FALSE   892   13
  TRUE    112  183
In [28]:
#Accurary
Accuracy <- (892+183)/sum(table(testData$Negative, predictCART))
round(Accuracy,3)
Out[28]:
0.896

Let's plot the ROC curve.

In [29]:
Prediction_ROC <- predict(modelCART, newdata = testData)
pred <- prediction(Prediction_ROC[,2], testData$Negative)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)

The area under the curve is give by:

In [30]:
performance(pred, "auc")@y.values
Out[30]:
  1. 0.80543496582077

Let's compare the CART model with a random forest classification model.

First let's load the required library for the random forest model.

In [31]:
#Random forest model
modelForest <- randomForest(Negative ~ ., data = trainData, nodesize = 25, ntrees = 200)
In [32]:
predictForest <- predict(modelForest, newdata = testData)
table(testData$Negative, predictForest)
Out[32]:
       predictForest
        FALSE TRUE
  FALSE   889   16
  TRUE    105  190
In [34]:
#Accurary
Accuracy <- (889+190)/sum(table(testData$Negative, predictForest))
round(Accuracy,3)
Out[34]:
0.899

Conclusion

    Presidential candidate Donald Trump has the most positive comments while the sentiment towards Hillary Clinton is negative. Tweets containing the words: dump, falls, hate, lair, attack, conservative, and politica convey negative sentiments. The performances of the CART ajd random forest classification models are almost similar. Both models are reasonably good in predicting negative tweets.