Data Innovation Survey results – In Neo4j

(reblogging from post originally published over here)

Today, I have had loads of fun at the Data Innovation Summit in Brussels, Belgium. Hosted in the beautiful Axa Belgium offices, it was a great opportunity to meet 500 (!!) data-minded professionals. I was also able to do an Ignite Talk there, which was quite an experience. 15 seconds for every slide, and no way for you to change the slides yourself and determine the “rythm” – very different. Here are the slides:


But that was not the coolest thing. They also did a “Data Innovation Survey“, which was super cool. The data is all open (find it in this gist), and I of course took it from Excel

create a graph MODEL out of it
and then load it into Neo4j using this load script. You will need to tweak the load csv file locations, but after that: just download Neo4j 2.2, fire up the Neo4j-shell, and paste all the commands into it. Should be a matter of half a minute to load the data.
Then we have the data in Neo4j, and we can start doing some queries. Now, I must admit that I am not a huge fun of working the data this way – as there are very few intricate relationships that we can use meaningfully. Nevertheless, here are a few queries:
 //respondents and techniques with PhDs  
 MATCH (dl:DegreeLevel {name:"PhD"})--(r:Respondent)--(t:Technique)  
 return dl,r,t  
That’s easy:

Let’s make it a bit more sophisticated:

 //respondents and techniques at level 5 with PhDs and their DegreeFields  
 MATCH (dl:DegreeLevel {name:"PhD"})--(r:Respondent)-[ht:HAS_TECHNIQUE {level:'5'}]--(t:Techniques),  
 (r)--(df:DegreeField)  
 return dl,r,t,df  
 limit 10  
You can see how that would make the visualisation a bit more complicated.
And then finally, here is a first attempt at doing something a bit more “graphy”. Let’s see which “DegreeFields” are the most important in our graph. In other words – the most “Between” the other nodes of the graph. We do that with a query like this:
 //betweenness centrality of the "DegreeFields"  
 MATCH p=allShortestPaths((r1:Respondent)-[*]-(r2:Respondent))  
  WHERE id(r1) < id(r2) and length(p) > 1  
  UNWIND nodes(p)[1..-1] as n  
  WITH n, count(*) as betweenness, labels(n) as labels  
  WHERE "DegreeField" in labels  
  RETURN n.name, betweenness  
  order by betweenness desc;  
and then we see this result:
There’s a lot of importance to Science/Mathematics, ICT and Engineering. Who would have thought?
You can of course apply these techniques much more generically to other problems, and that is mostly why I share it here. I hope others find it interesting, and as always…
… Feedback welcome!
Cheers
Rik

2 thoughts on “Data Innovation Survey results – In Neo4j

  1. Pingback: Top 5 presentations of DIS2015 (Data Science Innovation Summit). | The Brussels Data Science Community

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s