The results from the Data Innovation Survey are in – and offer a good opportunity to play with a nice data set. What follows is a very preliminary analysis of the data.

As you’ll note from the above graph, time was very short – both for the people filling the survey, and for the people trying to do an analysis before the big event of the Data Innovation Summit 2015.

The most interesting parts of the survey were Question 11 and Question 13. The idea with these questions was to create a picture of what kind of data scientists are active in Belgium, and what they perceived to be their most important skills. Both questions resulted in a multivariate data set. Only the first, replies to Question 11 are analysed. The results of the other questions, including Question 13, will be analysed in time for the Finals of the competition.

Question 11 asked survey participants to score their skills, on a scale from 1 to 5, where 1 was ‘low’, meaning no familiarity with this skill, to 5, where the respondent considered himself an expert. The list of 22 skills was taken from a publication, ‘Analyzing the Analyzers’, a publication by Harlan Harris, Sean Murphy and Marck Vaisman, available for free from O’Reilly.

In order to compensate for differential scoring of different respondents, the raw scores were replaced by the ranks each of the skills was given by that respondent. This has the net effect of standardising the data, with a constant sum for each row/respondent, corresponding to the sum of the ranks of the 22 skills.

In the figure above, a box-and-whisker plot is shown for the standardised scores on each of the 22 skills. The heavy line in the centre of the box corresponds to the median; the box itself is drawn between the low and high quartile. The whiskers extend from the quartile, for a maximal distance of 1.5 times the interquartile range.

The box-and-whisker plot clearly demonstrates what respondents thought were their most important skills. The same information is summarised in the plot above, displaying a bar plot, ordered according to the means of the standardised skill scores; the mean was standardised to scale to a theoretical maximum of 1.

The skills are obviously not unrelated, as they reflect the interests and the talents of the respondent. In the dendrogram displayed above, the relationships between these skills are shown, based on how they co-vary within respondents. Not surprisingly, ‘Business Skills’ and ‘Product development’ seem to be closely together, as are ‘Big Data’ and ‘NoSQL’.

Correlations between standarised skill scores are illustrated in this heat map. Rows and columns correspond with specific skills; since the set of skills is the same along row and columns, the diagonal shows up as dark blue, corresponding with a correlation of 1. Lighter blue off-diagonal correspond with pairs of skills that tend to be co-represented in individuals – such as the examples mentioned above (business skills and product development; Big Data and NoSQL). Brown colours correspond with pairs of skills that are rarely present in a single respondent (such as Product development and Machine Learning; Statistics and Back-end Programming).

Another method of looking at the correlational structure of a data set is through a Principal Component Analysis. Part of the output of a PCA, a ‘biplot’, is shown above. In a PCA, the original data are reprojected, to move as much of the variation present in the data to as little dimensions as possible. These new dimensions are the ‘Principal Components’. In the figure above, a plot is presented of the first two Principal Components. Each of the dots represents a single respondent, the Principal Component scores. The red arrows illustrate the correlation of the original (standardised skill score) variables with the Principal Components (the PC loadings), and with each other. Here again we see that Business skills and Product Development vary together, and are ‘orthogonal’ to more technical skills such as SQL and system administration.

The first two Principal Components, as displayed in the biplot above, only capture part of the variance in the data. In a scree plot, the variance captured by each of the Principal Components is displayed. Clearly, and as intended by a PCA, the first few PC axes capture large fractions of the variance, but by no means all of it. It will be necessary to plot more than just the first two PC scores against each other, to get a better idea of the structure of the data.

Pingback: Thank you for making the Data Innovation Summit a success | The Brussels Data Science Community

Pingback: Top 5 presentations of DIS2015 (Data Science Innovation Summit). | The Brussels Data Science Community