Notes‎ > ‎

Mining Prod Comments. Linked Data, Texting Mining and R. Some Notes.

posted 31 Jan 2012, 02:00 by David Sherlock   [ updated 2 Feb 2012, 06:32 ]
I am not claiming to be a pro with anything mentioned in the title. I’m pretty sure I’ll have messed up and made the results squiffy.  Still, we all have to start somewhere. PROD is a project directory of JISC projects that I maintain at a technical level (but don’t feed the beast it’s data).  Someday I’d like to see PRODs transcendence and see it join the linked data organism into Godhood, then its physical manifestation as a web interface would die and we’d see a 1999 style endgame video. Lets just start by getting some data into a google spreadsheet for now. 

1. I have got a spreadsheet of all PROD comments and their Author. This isn’t hard to get

SPARQL Query is:

PREFIX foaf:  <http://xmlns.com/foaf/0.1/>
PREFIX prod:  <http://prod.cetis.ac.uk/vocab/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT *
WHERE {    
 ?subject prod:comment ?comment . 
 ?subject dc:author ?author .
}

Used Proxy and Google Spreadsheet (why doesn’t other Google Refine work?) Use the getbyurl style function.

2) Get this into R

Open R studio, create a corpus, lowercase no punctuation. natch. Wrote something that looks like this:

Load the tm library
prod_csv <- read.csv("/Users/David/Desktop/PHd/ds10-Text-Mining-Weak-Signals-8ffa02a/Source Data/comments.csv")
mydata.corpus <- Corpus(VectorSource(prod_csv$Comment))
mydata.corpus <- tm_map(mydata.corpus, tolower)
mydata.corpus <- tm_map(mydata.corpus, removePunctuation)

the Magic happens here:

mydata.dtm <- TermDocumentMatrix(mydata.corpus)
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.95)
mydata.df <- as.data.frame(inspect(mydata.dtm2))
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?


You get this:


 



What you are looking at here is a dendrogram. The terms that are higher up are more popular, and terms that are closer together are linked together. At the moment I think we are seeing popular sentences. Lots to do around resources. So my next steps where to run this through again pulling our words like ‘and’ this is done by taking out ‘stopwords’. I grabbed a list from here:  

http://www.lextek.com/manuals/onix/stopwords1.html

Pretty sure there is an easy way to do this, but I added:

my_stopwords <- c(stopwords('english'), 'a', 'about', 'above', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also','although','always','am','among', 'amongst', 'amoungst', 'amount',  'an', 'and', 'another', 'any','anyhow','anyone','anything','anyway', 'anywhere', 'are', 'around', 'as',  'at', 'back','be','became', 'because','become','becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom','but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own','part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves', 'the')

then:

mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.95)
mydata.df <- as.data.frame(inspect(mydata.dtm2))
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?



My next step was to do this for a singular program, but the triplestore seems to have died.

Tuesdays library time should be used to get into some approaches and understand them properly, that is hierarchical agglomerative, partitioning, and model based.

Edit:

I did this for Curriculum Design comments:






Used Resources:

http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/
http://www.statmethods.net/advstats/cluster.html 
ą
David Sherlock,
31 Jan 2012, 04:28
ą
David Sherlock,
31 Jan 2012, 04:28
ą
David Sherlock,
31 Jan 2012, 02:00
ą
David Sherlock,
31 Jan 2012, 02:00
Comments