Leo De Bock brings the speech of Minister Kris Peeters during the Data Innovation Summit

Leo De Bock Leo De BockLeo De Bock 1

It is so nice to be supported by Minister Kris Peeters and his team during our first Data Innovation Summit. Thank you Leo for the excellent presentation.

  • Here is the video from Leo De Boeck.
  • See all presentations from the summit here
  • See all  the pictures from the event here

 

Ladies and gentlemen, 

First of all, I would like to apologise Vice Prime minister Kris Peeters for not being able to make it to this meeting due to a political meeting that succeeded to eventually dominate his agenda.  

On days like these, when the bright minds of the world of data come together, we have a great opportunity to look ahead. We can discuss the next step in a field that progresses at an impressive speed. 

Now that digital has become the new normal, what will be the new extraordinary?  

The digital economy is one of the most dynamic and promising sectors in terms of development potential. Its possibilities for growth far exceed those of other sectors. Mobile data traffic doubles each year, the use of the internet does so every two or three years. Today, 4 million people work in ICT in the European Union and their share increases with 3% per year, despite the economic crisis. 

For the federal government of Belgium, our prime goal is to translate this digital growth into job creation. This is why we are developing our digital agenda. If we want to stay ahead as a digital nation, we’ll need to invest. We have a more developed high speed internet infrastructure than most other countries. We do not want to give up on that advantageous position.  We instead wish to continue to invest in a 4G and LTE advanced network in Belgium. 

At the same time, we need to invest in our regulatory framework. We need to update our privacy legislation that dates back to 1992, which, in digital terms, is the stone age. Privacy is a key driver for digital progress. The digital revolution will come to a halt when people’s trust diminishes. People have to be granted the right to information ànd the right to be overlooked, ignored, forgotten. Moreover, people need to feel safe when they go online. 

The government therefore, wants to make the CyberSecurityCenter operational this year. This center will work out a strategy to secure our nation’s digital network and the information we have online. Moreover, the Federal Public Service for Economy organizes campaigns to raise awareness on the data and digital fingerprint that people leave online.  

In today’s age of big data, these kinds of campaigns have become a necessity. Never before in the history of mankind, data have been collected, processed and linked at this massive a scale. A company’s power today is not just valued in terms of capital, but also in terms of data. Data is the new gold. Data means a company can produce, ship and market their products and services far more efficiently. 

In fact, our cabinet has already reaped the benefits of big data. We have been working together with a promising Belgian start-up, called Dataminded, who analyse our social media activities and we have already changed our communication policy on the basis of this analysis. 

Both the Vice Prime Minister and his staff are convinced of the great added value of big data. But, ladies and gentlemen, this does not mean that all bets are off. As I mentioned earlier, we need some form of regulation. Moreover, we need to think about creating a level playing field. 

The term ‘big data’ is quite self-explanatory. Big companies, first and foremost, are capable of gathering a critical mass of data for useful analysis. Those companies have the means to buy data or to invest in data processing. The question therefore is, how we can make sure that small and medium sized companies – those that somewhat are the backbone of our economy – can also become part of the big data story.  

This is why the federal government is working on legislation that makes open data accessible to citizens, companies and researchers. The exchange of data between governments and other organisations, will strengthen literally every citizen and company. Sharing data means strengthening everybody. 

Companies that own data, need to keep this in mind. Big data should be more than a new way to maximize profits. Big data should also benefit society. Take product information for instance. It is abundantly present in digital networks and it is used to reduce costs, boost productivity and make marketing efforts more efficient. Product information is often highly specialized, technical and exhaustive. It is so exhaustive that it confuses the average consumer. This is the point where big data should cross the boundary between economic logic and social logic. 

Let me give you one very concrete example. In December of last year, a new European regulation on Food Information to Consumers came into force. This legislation will ensure that consumers get more information on the food products that are put on sale in stores. But given the abundance of information, it is difficult for the consumer to use this in a meaningful way. So wouldn’t companies rather share their product information with others who can present these data in a more comprehensive way? Isn’t is socially responsible, isn’t it a corporate social duty for them to share information so that this legislation can actually be applied and we can make the consumer more aware, give him a chance to make rational decisions? This is where data should be turned into knowledge. 

Now, some companies will consider this a threat. But frankly, they are wrong. If you take initiative, you create opportunities. You get ahead instead of trailing the pack. First of all, transparency about data creates trust. And trust boosts business. Secondly, when companies provide their data directly, they can be sure that the data on the market are correct. Companies that continue to shield and hesitate and stay aloof, make the wrong choice. Because eventually the data will see the light of day. The huge multinational internet companies will put this data on the market sooner or later. And they will not wait for an agreement or cooperation. 

 

Ladies and gentlemen, 

Big data is indeed big business. But it also means big responsibility. While having the new gold in your hands, you should think twice about what you use it for. Eleanor Roosevelt once said that “if you need to handle yourself, you should use your head; but if you have to handle others, use your heart.” 

I hope hearts and minds will work together when we develop the new extraordinary. Because that is what big data is. Now that digital has become the new normal, big data is the next leap.  

For now, I wish all of you a productive and fruitful conference. And together with Vice Prime Minister Peeters, I do look forward to the great innovations that all of you, the bright minds of big data, will create in the years to come. 

I thank you.

 

The question “Are all Data Scientists nerds?” answered thanks to the Data Innovation Survey 2015

This article was originally published here

Although the Data Scientist has been declared the sexiest job of the 21st century by HBR and others, if we are honest, we need to admit that data scientists are still associated with nerds by the mainstream population. This data innovation survey was the perfect opportunity to me to investigate whether data scientists are really that nerdy as perceived by many.

I started this article by looking up some background information (after all, I do consider myself as a data scientist) on nerds. I found a very appropriate description on Wikipedia:

nerdNerd (adjective: nerdy) is a descriptive term, often used pejoratively, indicating that a person is overly intellectual, obsessive, or socially impaired. They may spend inordinate amounts of time on unpopular, obscure, or non-mainstream activities, which are generally either highly technical or relating to topics of fiction or fantasy, to the exclusion of more mainstream activities. Additionally, many nerds are described as being shy, quirky, and unattractive, and may have difficulty participating in, or even following, sports. Stereotypical nerds are commonly seen as intelligent but socially and physically awkward. Some interests and activities that are likely to be described as nerdy are: Intellectual, academic, or technical hobbies, activities, and pursuits, especially topics related to sciencemathematicsengineering and technology.

Does any of this sound familiar to you?

Let’s dive into the results of the data innovation survey, together with my best friend SAS Visual Analytics, to check if these stereotypes are true in the Belgian Data Science Landscape.

Stereotype n°1: All data scientists are young males

It probably doesn’t come as a surprise to you that the 87.2% of the respondents are male, but I’m glad to see that 36 other woman took the survey along with me. In terms of age, we do find a lot of youngsters, but the categories above 35 seem to be well represented too.

ds1
Note to the designer of the survey: next time please don’t foresee fixed age categories but let people type their real age if you want to see more interesting graphs than poor pie charts…

 Stereotype n°2: Data scientists are in front of their computer all night

Participants had nine days to respond to the survey. In the bar chart below you can see on which days the 289 respondents submitted the survey. We observe a clear pattern in the beginning of both weeks and strangely enough a drop towards Friday 13th… Maybe data scientists are more superstitious than they would like to admit?

Even more interesting to analyze are the times of the day when people took the survey. To my big surprise there’s a peak in the morning, so the Belgian data scientists seem to be early birds!

ds2

As we received the start time and the end time, I also calculated how long the average data scientist took to solve the questionnaire: 12.66 minutes, but the median data scientist had the job done in 10 minutes. We all remember our first statistics class: when the median is not equal to the mean, there is no symmetric distribution…

ds3ds4

Stereotype n°3: Data scientists are disconnected from the real world

If all data scientists are actually nerds, then they should all be quite “unworldly”. According to the Belgian Data Science survey, almost one third is working for a business organization or NGO with 7 777 employees worldwide on average, doesn’t sound that nerdy to me…

ds5

In total, 42% of the Belgian data scientists who took the survey are employed in the IT and technology industry. Ok, what else did you expect?

ds6

If data scientists were really that socially inadequate as what could be believed by some bad influences, ds7  they would never make it to a management position in their organization. And look, almost 55% our respondents have management responsibilities to a certain extent.

Stereotype n°4: All Data scientists hold a PhD in science or mathematics

Wrong again! Only 18.3% of the Belgian Data Scientists are holding a PhD degree. Although the majority graduated in science&math, ict or engineering, a significant amount completed commerce or social studies.

ds8

ds9

Stereotype n°5: All Data scientists are programming geeks and only use non-mainstream techniques

In part 6 of the survey, participants were asked to rate their skills with a score between 1 (don’t know this technique) and 5 (I’m a guru). It turns out that data scientists are not all guru’s in the newer techniques like big data and machine learning but are mostly familiar with traditional techniques like data manipulation (regexes, Python, R, SAS, web scraping) and structured data (RDBMS, SQL, JSON, XML, ETL).
ds10Although we observe some quite high correlations (between math & optimization 0.73, big data & unstructured data 0.67, …) it doesn’t necessarily mean that the scores are high on these topics. This is clearly illustrated with the heat maps below. On the left we have math and optimization which are highly correlated but with low scores, and on the right there is data manipulation and structured data with a moderate correlation of 0.42 but with the highest scores.

ds11 ds12

Stereotype n°6: All Data scientists are socially isolated and afraid to appear in public

The Belgian Data Scientists don’t only attend the monthly meetup meetings to learn about the new developments in Data Science or to hear what’s happening on the Belgian Data Science scene, but many of them also state social and networking reasons as motivation to get away from their pc to attend these meetings.

ds13

Stereotype n°7: There are clear role models for data scientists, they all look up to the same persons

Not that many respondents seem to be influenced by other data scientists in this world, as only a few of them answered this question with the name of a fellow data scientist and mostly different ones. For Belgium on the other hand, we do find two names that each appeared eight times among the answers. Congratulations to Bart Baesens and Philippe Van Impe, the Belgian Data Science guru’s!

ds14

Conclusion

The conclusion of the analysis of the Data Innovation Survey is as straightforward as simple: Data Scientist is the sexiest job of the 21st century! Unfortunately I’ll have to finish off here as my pole dancing class is going to start…

Thank you for making the Data Innovation Summit a success

sponsorsThank you all for your engagement and active participation to our Data Innovation Summit.

Yesterday 68 presentations were delivered on time allowing over 400 data lovers to have enough time to network and share ideas with their peers.

This would not have been possible without a professional team of volunteers, a team of friends making the craziest schedules possible.

I would like to thank AXA and our sponsors for supporting us.

The speakers were amazing, tortured to accept the most horrible presentation format called ignite and delivering it with so much grace and passion, beautiful.

What a pity that we could not schedule all presentations and that we had to turn down so many participants because the event was sold out. We will handle this differently next time.

Together we have reached our first milestone yesterday, it is time to wake up now and work together to accelerate the development of new projects that will position us better in this merciless digitalization race. Let’s bundle our energy and put Belgium back on the map of most innovative countries, the place where it is so easy to start-up a company.

Your turn now, please give us your feedback about our Summit: https://nl.surveymonkey.com/s/Happy-DIS2015

The pictures taken at the summit will be available on https://www.facebook.com/Datasciencebe.

The presentations video’s will also soon be available on https://www.parleys.com/channel/datascience

Have you seen the analysis of the results (over 600) of the survey made by Ward, Dieter & Nicholas, Nele and Rik. I’m looking forward to these presentations during the finals of April 16th.

Thanks again,

Philippe Van Impe

Data Innovation Survey results – In Neo4j

(reblogging from post originally published over here)

Today, I have had loads of fun at the Data Innovation Summit in Brussels, Belgium. Hosted in the beautiful Axa Belgium offices, it was a great opportunity to meet 500 (!!) data-minded professionals. I was also able to do an Ignite Talk there, which was quite an experience. 15 seconds for every slide, and no way for you to change the slides yourself and determine the “rythm” – very different. Here are the slides:


But that was not the coolest thing. They also did a “Data Innovation Survey“, which was super cool. The data is all open (find it in this gist), and I of course took it from Excel

create a graph MODEL out of it
and then load it into Neo4j using this load script. You will need to tweak the load csv file locations, but after that: just download Neo4j 2.2, fire up the Neo4j-shell, and paste all the commands into it. Should be a matter of half a minute to load the data.
Then we have the data in Neo4j, and we can start doing some queries. Now, I must admit that I am not a huge fun of working the data this way – as there are very few intricate relationships that we can use meaningfully. Nevertheless, here are a few queries:
 //respondents and techniques with PhDs  
 MATCH (dl:DegreeLevel {name:"PhD"})--(r:Respondent)--(t:Technique)  
 return dl,r,t  
That’s easy:

Let’s make it a bit more sophisticated:

 //respondents and techniques at level 5 with PhDs and their DegreeFields  
 MATCH (dl:DegreeLevel {name:"PhD"})--(r:Respondent)-[ht:HAS_TECHNIQUE {level:'5'}]--(t:Techniques),  
 (r)--(df:DegreeField)  
 return dl,r,t,df  
 limit 10  
You can see how that would make the visualisation a bit more complicated.
And then finally, here is a first attempt at doing something a bit more “graphy”. Let’s see which “DegreeFields” are the most important in our graph. In other words – the most “Between” the other nodes of the graph. We do that with a query like this:
 //betweenness centrality of the "DegreeFields"  
 MATCH p=allShortestPaths((r1:Respondent)-[*]-(r2:Respondent))  
  WHERE id(r1) < id(r2) and length(p) > 1  
  UNWIND nodes(p)[1..-1] as n  
  WITH n, count(*) as betweenness, labels(n) as labels  
  WHERE "DegreeField" in labels  
  RETURN n.name, betweenness  
  order by betweenness desc;  
and then we see this result:
There’s a lot of importance to Science/Mathematics, ICT and Engineering. Who would have thought?
You can of course apply these techniques much more generically to other problems, and that is mostly why I share it here. I hope others find it interesting, and as always…
… Feedback welcome!
Cheers
Rik

DIS2015 – Thank You Mr Carette

Regie

Normally you would find Mathieu on stage explaining how he helped MSF to get more value out of their data, or clarifying the most efficient method to identify the influencers of a community using twitterfeeds, but not at the Data Innovation Summit. This year Mr Carette (because from now on this is how we should call him) took upon him to manage the whole technical part of the summit and manage and time all the presentations.

Yesterday 68 presentations were delivered on time allowing over 400 data lovers to have enough time to network and share ideas with their peers.

Thank your Mr Carette & Many happy returns of the day !

Philippe

Please give us your feedback on the Data Innovation Summit 2015

As we write this, the Data Innovation Summit 2015 is in full swing at the AXA building in Brussels. We hope you enjoy the show. We not only hope you enjoy the show, we also want to know ho much you enjoy it, which parts you like most, which formats of presentations you find best, and a couple other things. In order to find out, we organised a small survey (don’t panic – quite a bit shorter than the previous one!). Please take a couple of minutes and visit https://nl.surveymonkey.com/s/Happy-DIS2015, and tell us what you think.

And of course, comments to the organisers can also be given in person. Your feedback will be invaluable to make the next event even better.

Enjoy the rest of the day.

Philippe Van Impe and Edward Vanden Berghe

Data Innovation Survey 2015 – preliminary analysis

The results from the Data Innovation Survey are in – and offer a good opportunity to play with a nice data set. What follows is a very preliminary analysis of the data.

accumulation

As you’ll note from the above graph, time was very short – both for the people filling the survey, and for the people trying to do an analysis before the big event of the Data Innovation Summit 2015.

The most interesting parts of the survey were Question 11 and Question 13. The idea with these questions was to create a picture of what kind of data scientists are active in Belgium, and what they perceived to be their most important skills. Both questions resulted in a multivariate data set. Only the first, replies to Question 11 are analysed. The results of the other questions, including Question 13, will be analysed in time for the Finals of the competition.

Question 11 asked survey participants to score their skills, on a scale from 1 to 5, where 1 was ‘low’, meaning no familiarity with this skill, to 5, where the respondent considered himself an expert. The list of 22 skills was taken from a publication, ‘Analyzing the Analyzers’, a publication by Harlan Harris, Sean Murphy and Marck Vaisman, available for free from O’Reilly.

In order to compensate for differential scoring of different respondents, the raw scores were replaced by the ranks each of the skills was given by that respondent. This has the net effect of standardising the data, with a constant sum for each row/respondent, corresponding to the sum of the ranks of the 22 skills.

boxplot skillscores

In the figure above, a box-and-whisker plot is shown for the standardised scores on each of the 22 skills. The heavy line in the centre of the box corresponds to the median; the box itself is drawn between the low and high quartile. The whiskers extend from the quartile, for a maximal distance of 1.5 times the interquartile range.

skills ordered

The box-and-whisker plot clearly demonstrates what respondents thought were their most important skills. The same information is summarised in the plot above, displaying a bar plot, ordered according to the means of the standardised skill scores; the mean was standardised to scale to a theoretical maximum of 1.

skills dendrogram

The skills are obviously not unrelated, as they reflect the interests and the talents of the respondent. In the dendrogram displayed above, the relationships between these skills are shown, based on how they co-vary within respondents. Not surprisingly, ‘Business Skills’ and ‘Product development’ seem to be closely together, as are ‘Big Data’ and ‘NoSQL’.

heatmap skillscores

Correlations between standarised skill scores are illustrated in this heat map. Rows and columns correspond with specific skills; since the set of skills is the same along row and columns, the diagonal shows up as dark blue, corresponding with a correlation of 1. Lighter blue off-diagonal correspond with pairs of skills that tend to be co-represented in individuals – such as the examples mentioned above (business skills and product development; Big Data and NoSQL). Brown colours correspond with pairs of skills that are rarely present in a single respondent (such as Product development and Machine Learning; Statistics and Back-end Programming).

pca skillscores

Another method of looking at the correlational structure of a data set is through a Principal Component Analysis. Part of the output of a PCA, a ‘biplot’, is shown above. In a PCA, the original data are reprojected, to move as much of the variation present in the data to as little dimensions as possible. These new dimensions are the ‘Principal Components’. In the figure above, a plot is presented of the first two Principal Components. Each of the dots represents a single respondent, the Principal Component scores. The red arrows illustrate the correlation of the original (standardised skill score) variables with the Principal Components (the PC loadings), and with each other. Here again we see that Business skills and Product Development vary together, and are ‘orthogonal’ to more technical skills such as SQL and system administration.

scree skillscores

The first two Principal Components, as displayed in the biplot above, only capture part of the variance in the data. In a scree plot, the variance captured by each of the Principal Components is displayed. Clearly, and as intended by a PCA, the first few PC axes capture large fractions of the variance, but by no means all of it. It will be necessary to plot more than just the first two PC scores against each other, to get a better idea of the structure of the data.

Data Innovation Summit Dashboard

Very nice blogpost from Dieter De Witte & Nicholas Ocket about the analysis of the results of our Survey.

hippomongous

Motivation:

Often the first step in analyzing a dataset is thinking of different ways of visualizing your raw data: exploratory data visualizations. In general these visualizations are only used by the data scientist for personal use. After deriving some insights these insights can be communicated using explanatorydata visualizations. There is actually a problem with this approach: Data scientist are generally not good at all aspects of Data Science. Some people are experts in Machine Learning, some are code gurus but in general the third collection (figure on the right), being the domain expertise is somewhat ignored. Can one derive insights from data without extensive domain knowledge?

In this context the transformation from exploratory to explanatory data analysis is problematic since this transformation is performed by the data scientist and during this transformation information gets lost.

Therefore a third approach is feasible. Can exploratory data visualizations be an…

View original post 1,267 more words

What ? DIS2015 is sold out ? How is this possible for a free event ?

soldout1

Dear friends, dear data science experts,

I’m so proud that our first Data Innovation Summit is sold out.

Experts who took the survey will receive a confirmation tonight  and are invited to take the challenge = present their own analysis of the results of the survey at our meetup of April 16th.

If you did not manage to get your free access pass, no worries, all the presentations will be made available on our private channel.

I’m looking forward to meeting most of you on Thursday.

Let’s build this datascience knowledge hub together.

Philippe

Ready for the Data Innovation Summit, Survey & Challenge ?

DIS2015

Your community is holding its first full day summit this week on Thursday in Brussels.

We have received so much support from all parts of Belgium. I’m so proud of all our actors contributing to this amazing agenda. It will be a fast paced day with over 50 presentations and over 30 exhibitors. You will see and meet:

  • Most important Universities of Belgium
  • Major corporations will explain how they prepare to become data driven
  • Political powers represented by Kris Peters
  • Over 20 start-ups
  • Over 480 data experts
  • Over 30 exhibitors

Here are some Highlights of the day are:

  aula

  • The presentation of Kris Peeters & Vincent Blondel
  • The presentations of Reservoir Lab from UGent. 100 k$ winner of the international Kaggle competition.
  • The launch of the European Data Innovation Hub.

The challenge

If you have already registered and are on the waiting list, take the survey that will lead you to your ticket directly. Once you have answered this short poll we would like to challenge you and your team to analyze this data and to prepare a presentation for our finals.

About the summit:

  • We open our doors at 07:30. The Axa building is situated Boulevard du Souverain, 25. 1170 Bruxelles.
  • A parking space has been reserved for you. Use the Tenreukenlaan to proceed to the visitors parking of Axa.
  • The venue is 10 minutes walk from metro Hermann Debroux.
  • All presentations will be recorded and made available on our website.
  • Here is the list of participants .
  • Press Coverage: We have issued a press release

I’m looking forward to meeting you on Thursday.

Philippe Van Impe

%d bloggers like this: