One Mode Data


1. Turning your CSV into a One Mode data Matrix and Graph

Disclaimer: This is basically as a collection of notes I wrote for myself when I forget everything I have read on the subject in 6 months time . Take with a pinch of salt. This literally scrapes the surface. 


Getting into Social Network Analysis (SNA) is difficult, there seems to be so much to know about the area but so little practical advice. It helps to know that it is a strand of Complex Networks that looks at interactions between nodes in a social network. While online social networks might be a relatively new concept, Complex Networks are  it helped me to get into SNA by first getting my head around Complex Network concepts. I highly recommend chapter 13 of Statistical Pattern Recognition.

Graph?



Wikipedia describes a graph as 'an abstract representation of a set of objects where some pairs of the objects are connected by links.'

Grab some networked data

You’ll need social network data if you are to do any kind of social network analysis. In Complex Networks a network can be represented
mathematically as a graph G = { V , E }.  This is to say your graph will have a bunch of nodes connected, the connections between these nodes are called edges, If a node is connected to another node then the connected node is called a neighbour.

At present I am more interested in getting my head around SNA and working out a 'data mining->preparation->evaluation' process than exploring the data itself.

The data I am going to play with is a list of organisations coming to JISC CETIS events from 2007 – 2012. This data isn’t perfect but its something I already so happen to have. There are some drawbacks, recently CETIS has started to use an external events system, delegates of which are not included. Many delegates put different names for the same organisation (e,g MMU or Manchester Met) although I have done my best to tidy this up in Google Refine. Still, at present I am more interested in establishing a process to use with future datasets.







Currently my dataset is in CSV format. I quite often have data that I've grabbed from mysql/linked data store/excel and tidied up in Google Refine. Currently my data looks something like this:


Example of data in Google Refine

Now I have a problem, my data is a table and in my head I can see something that I think represents nodes and something that represents the edges between them. My computer has no idea of the relationships in my data and I need a way of telling it.  I want to take my CSV and turn it into a Graph I can explore, something like this:

Why is nothing ever this simple

So my problem is turning my spreadsheet into a data format that can be plotted as a graph. For one mode undirected data one solution is to turn it into an adjacency matrix, tools such as R can import my dataset and create a matrix for me. I’m still new to R but it seems like it might be a powerful way to turn my CSV dataset into networked data. I use Rstudio as my R weapon of choice, so I booted it up and borrowed some examples from the from the standard comm department to write some code that opened my CSV, read it as a table and turned it into an adjacency matrix, which was exported as CSV.

organisations_events<-read.csv("/Users/David/Desktop/PhD/R_github/ROI/data/Ins_Event.csv" , header=T, sep=",")

df<-read.csv("/Users/David/Desktop/PhD/R_github/ROI/data/Ins_Event.csv" , header=T, sep=",")

M = as.matrix( table(df) )

Mrow = M %*% t(M)

#Mcol = t(M) %*% M

write.csv(Mrow, "test.csv")

The adjacency matrix tells the computer if there is an edge between one nodes, this particular matrix is weighted and looks something like this:


This is a one mode dataset and is easy to turn into a graph, as Steve Borgatti describes it, "2-Mode Matrix. A (2-dimensional) matrix is said to be 2-mode if the rows and columns index different sets of entities (e.g., the rows might correspond to persons while the columns correspond to organizations). In contrast, a matrix is 1-mode if the rows and  columns refer to the same set of entities, such as a city-by-city matrix if distances." 

R can then create a graph out of this:

iMrow = graph.adjacency(Mrow, mode = "undirected")

E(iMrow)$weight <- count.multiple(iMrow)

iMrow <- simplify(iMrow)

You can do all sorts of interesting things in R, but I think its a dangerous place to play with new concepts that you are not familiar with. I tried Stanfords example to spruce it up but it didn't come out great and I am not familiar enough with the domain to improve it on my own:

iMrow = graph.adjacency(Mrow, mode = "undirected")

E(iMrow)$weight <- count.multiple(iMrow)

iMrow <- simplify(iMrow)


# Set vertex attributes

V(iMrow)$label = V(iMrow)$name

V(iMrow)$label.color = rgb(0,0,.2,.8)

V(iMrow)$label.cex = .6

V(iMrow)$size = 6

V(iMrow)$frame.color = NA

V(iMrow)$color = rgb(0,0,1,.5)

 

# Set edge gamma according to edge weight

egam = (log(E(iMrow)$weight)+.3)/max(log(E(iMrow)$weight)+.3)

E(iMrow)$color = rgb(.5,.5,0,egam)

 

pdf("iMrow")

plot(iMrow, main = "layout.kamada.kawai", layout=layout.kamada.kawai)

plot(iMrow, main = "layout.fruchterman.reingold", layout=layout.fruchterman.reingold)

dev.off()

Produces something that doesn't mean much to anybody:

I thought it would help to leave R here with my graph and take it to a more user friendly tool. Its quite easy to export the graph in a format that most graph tools understand. I wanted to start with Gephi, which is less of a complex networks tool and more of a visualisation package. To export our work so far as graphml:

  write.graph(iMrow, file="graph.graphml", format="graphml");

Gephi is a easy tool to use and playing with the data in a visual manner is a fun thing to do.  I won’t cover how to use Gephi as there are lots of tutorials already out there. This post by Tony Hirst being a very good one and I had something more readable in 10 minutes.



           with labels                                                                          without labels

You would need to spend more time making the images look good,readable etc. I'm of the opinion that as soon as you export these graphs as a PNG and loose all the cool features of your SNA tool the data is not as useful, although a visualisation can be all you need to pull someone into a conversation!

Anyway, my goal of CSV->Graph was achieved. Hurrah.

 

Comments