Notes‎ > ‎

R: Sentiment comparison

posted 31 Jan 2012, 06:54 by David Sherlock   [ updated 2 Feb 2012, 06:35 ]
Listening is important, I tend to find that the twitter visualisations that concentrate on the data in tweets much more relevant than the ones that simply show links. Following off from yesterdays R marathon I thought i might compare sentiment lists to our PROD comments comments. I gave up half way through, but have documented the process anyway:

and loaded the positive and negative words:

hu.liu.pos = scan('/Users/David/Desktop/PHd/sentiment/positive-words.txt', what='character', comment.char=';')
hu.liu.neg = scan(‘/Users/David/Desktop/PHd/sentiment/negative-words.txt', what='character', comment.char=';')

then load the stuff, grabbed from PROD triple store.

prod_csv <- read.csv("/Users/David/Desktop/PHd/ds10-Text-Mining-Weak-Signals-8ffa02a/Source Data/comments.csv")
sample = prod_csv$Comment

then this:

pos.words = c(hu.liu.pos)
neg.words = c(hu.liu.neg)

don’t forget you can add your own words here like so

pos.words = c(hu.liu.pos, 'upgrade', ‘bestaroundnevergonnakeepmedown’)

install the score sentiment function:

score.sentiment = function(sentences, pos.words, neg.words, .progress='none'

# we got a vector of sentences. plyr will handle a list or a vector as an "l" for us
# we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
scores = laply(sentences, function(sentence, pos.words, neg.words) {
       # clean up sentences with R's regex-driven global substitute, gsub():
       sentence = gsub('[[:punct:]]', '', sentence)
       sentence = gsub('[[:cntrl:]]', '', sentence)
       sentence = gsub('\\d+', '', sentence)
       # and convert to lower case:
       sentence = tolower(sentence)
       # split into words. str_split is in the stringr package
       word.list = str_split(sentence, '\\s+')
       # sometimes a list() is one level of hierarchy too much
       words = unlist(word.list)
       # compare our words to the dictionaries of positive & negative terms
       pos.matches = match(words, pos.words)
       neg.matches = match(words, neg.words)
       # match() returns the position of the matched term or NA
       # we just want a TRUE/FALSE:
       pos.matches = !
       neg.matches = !
       # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
       score = sum(pos.matches) - sum(neg.matches)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)


Got some required packages and run:

sample = prod_csv$Comment
result = score.sentiment(sample, pos.words, neg.words)

Result now has all the comments and scores them on how positive they are. I guess the next step would be to match these against different programmes or projects. I’m going to give up as I’m not sure that this is the correct data to play with this technique. Something to think about.