Brussels loves AI

Why Brussels is the best place to open a European HQ for tech companies.

A recent report from EY shows that 2018 was an absolute record year for attracting foreign direct investments (FDI). A total of 278 foreign investment projects were initiated, representing a 29% increase on the previous year and creating an all-time high number of jobs (7,363). With this increase, Belgium is going against the European trend. More than 1500 international companies’ HQ are now already established in Brussels. The capital of Europe has everything a company needs: an open economy, business incentives and services, a strong infrastructure, and a talent pool of multilingual, highly skilled professionals in all sectors. Being the world’s 2nd most cosmopolitan city and located at the crossroads of Western European cultures, this is actually not very surprising.

Belgium has also essential assets regarding Artificial Intelligence (AI). 
It has a tradition of world-class AI researchers, some of whom were at the cradle of artificial intelligence. Additionally, Belgium ranks 9th in the “Digital Economy and Society Index” within the EU with a society that largely embraces AI as a novel and promising technology rather than a threat.

The Brussels Region had a pioneering role in AI since long before it became a buzzword, hosting, for instance, the first academic laboratory dedicated to AI in Europe at the University of Brussels. The most active research groups are The IRIDIA laboratoryThe AI Experience Centre  and The Machine learning Group

Hugues Bersini

Prof Dr Hugues Bersini, founder of the AI lab of the ULB


Ingrid Daubechies

Prof Dr Ingrid Daubechies, The most cited Belgian mathematician, recognized for her study of the mathematical methods that enhance image-compression technology


Belgian employees are among the leaders in the use of artificial intelligence in the workplace. 54% of employees are already confronted with artificial intelligence at work or expect this to happen in the next two years. The Belgians are also predominantly positive about the use of AI at work. More than six in ten say that AI has a positive effect on their productivity. 

The Brussels Region also supports companies in their AI endeavours, both from innovation, technical, finance and a business point of view.

  1. Innoviris, the Brussels Region funding body, has been providing significant support to AI-related research and innovation, steadily ramping up its support programs in this field, with a dedicated budget of € 20M over the past 2 years. In particular, an AI call (“Team Up”) aimed at fostering collaboration between academia and industry was launched in 2017. This program reflects the Brussels Region approach to AI development, which emphasises collaborative research and open innovation.
  2. For the technical aspects, one can mention the leading role of Sirris Brussels, through its Elucidata laboratory, and of the Icity.Brussels technology hub, both co-funded by the Brussels Region and ERDF
  3. The group SRIB/GIMB aims at providing financial support for the creation, reorganisation or expansion of private companies located in the region. In addition to investment, they support entrepreneurs in all phases of their company’s development. BRUSTART provides financial solutions to young innovative companies that are in the launching or starting-up phase. It supports technological innovation through the seed-capital fund, which is directed towards spin-offs.
  4. As for the business aspects, the Region Brussels Capital launched in 2017 the plan coordinated by and gathers main entities involved in AI in the Region. Moreover, and its cluster provide tailor-made support and advice.


One other important aspect of the Brussels is their support for the Tech communities. The Region expects to continue building on the success of the Data Science Community, made of AI experts and practitioners. This community counts several thousand members and organizes more than a hundred events per year and one European Data Innovation Summit.
The region has co-financed a dedicated tech and starter Hub called DigitYser where communities can gather to boost their digital skills and raise awareness on AI and data literacy. 

DigitYser also is the regional driver of AI4Belgium the community-led approach to enable Belgian people and organisations to capture the opportunities of AI while facilitating the ongoing transition responsibly. AI 4 Belgium has the ambition to position Belgium in the European AI landscape. We are currently assisting the newly elected government to define their AI strategy based on the European recommendations.

The Brussels’ AI strategy is based on 6 pillars

  • Data literacy and AI training for all
  • Support R&D and promote innovation
  • Setting up a responsible data strategy and promote open data
  • Promote the use of AI in the private industries & SME’s
  • Promote the use of AI in the public sector
  • Position Brussels as the driving AI city of Europe 

More skills development is made possible with the development of new training programs, namely with the Microsoft AI School, the Big Data Bxl and AI Black Belt initiative.

3 tech hubs are promoting AI actively:

  • Becentral – The digital campus of Brussels
  • Betacowork – Coworking and Community for freelancers & entrepreneurs.
  • DigitYser – The home of the AI & datascience community in Brussels

Brussels and Antwerp are in the top 15 hubs of artificial intelligence talent in Europe, according to a new report from the international venture capital firm Atomico. That is one of the conclusions in ‘The State of European Tech‘, a new report from venture firm Atomico and the Finnish tech conference Slush published in 2016. Atomico and Slush use LinkedIn data as a source for their conclusions. To be more precise: cities with the highest amount of LinkedIn members with AI skills.

Our main argument is that Brussels is also a very pleasant place to live and a family-friendly environment blessed with a welcoming heart. A business city teeming with opportunities, but also a relaxing home, ideal for enjoying life’s sweeter side. Lying at the crossroads of Western Europe’s cultures while being home to so many citizens from all over the world definitely makes Brussels a human-sized global city where local culture and cosmopolitanism get along perfectly.

So in summary here are the top 7 reasons to put Brussels in your shortlist:

  1. Located in the centre of Europe, 
  2. Low cost of living and high quality of life,
  3. Available AI talent at an acceptable price 
  4. Top universities and innovation programs, 
  5. Practical Regional & Federal support, 
  6. Very active AI communities and expert groups
  7. Tech & starters’ hubs to host and support your new venture  

Looking at this it is not surprising that Brussels has become the best place to open a European HQ for a tech company.


About the author: 

Philippe Van Impe is the driver of the AI & data science community of Belgium, the founder of DigitYser – the tech & starters’ hub of Brussels and a business coach for companies who wants to start their European HQ in Brussels.



Datascience meetup on Legaltech


We had in total 175 people registered for S05E03 meetup of the #datascience community of Brussels. 3 legaltech companies demonstrated how they are using datascience to offer better legal services.

Our speakers explained that the data in the legal industry is mainly words and explained us their approach to process vast amounts of unstructured data through Natural Language Processing techniques (NLP).

The event was held in the clubhouse of the AI & Datascience community at DigitYser on February 7th, 2019. Like always it started with a happy hour at 18:00. All presentations were recorded by Ricardo head of the technical team of DigitYser and made available on our different channels. YoutubeTwitter – Facebook.

Screenshot 2019-02-09 at 15.22.20.png

Presentation 1 started at 19:00:

How Reacfin proposed to use NLP and deeplearning to check conformity of legal and contractual documents in the re-assurance business.

Reacfin explained us a use case in the re-insurance business. They used NLP to automate the review of huge contracts. Aurélien Couloumy, head of data science, led us through their step by step approach to tackle this issue using Spacy.

Screenshot 2019-02-09 at 15.07.03

Presentation 2 started at 19:25:

How Jetpack.AI won the #hackBXLaw in November 2018.

The winners of the hackathon on debt recovery,, explained their problem solving approach that led them to win the contest and develop their tool: Magpie.

Dodo Chochitaichvili and Gautier Krings explained us how lawyers and data scientists worked together using an agile methodology & KISS principles  where the first iteration was the production of a skateboard.

Screenshot 2019-02-09 at 15.49.16.png

Screenshot 2019-02-09 at 15.34.19.png

Presentation 3 started at 19:50:

Darts-IP explained how AI combined with high quality data is revolutionising Intellectual Property practices.@

The managing director of Darts-ip, Evrard Van Zuylen, and the data scientist, Vignesh Baskaran, aka Vicky  introduced the technology they develop to create a comprehensive database on patents around the globe. They explained the data science process management that lead to their competitive advantage. ”

Screenshot 2019-02-09 at 15.50.18

Screenshot 2019-02-09 at 15.32.00

Screenshot 2019-02-09 at 15.04.24




We are please to announce that the call for speakers is out


Job – DATA-SCIENTIST – Koning Boudewijnstichting

Koning Boudewijnstichting
Samen werken aan een betere samenleving

De Koning Boudewijnstichting heeft als opdracht bij te dragen tot een betere samenleving.

De Stichting is in België en Europa een actor van verandering en innovatie in dienst van het algemeen belang en van de maatschappelijke cohesie. Ze zet zich in om een maximale impact te realiseren door de competenties van organisaties en personen te versterken. Ze stimuleert doeltreffende filantropie bij personen en ondernemingen.

Integriteit, transparantie, pluralisme, onafhankelijkheid, respect voor diversiteit en bevorderen van solidariteit zijn haar belangrijkste waarden.

Haar actiedomeinen momenteel zijn armoede en sociale rechtvaardigheid, filantropie, gezondheid, maatschappelijk engagement, ontwikkeling van talenten, democratie, Europese integratie, erfgoed en ontwikkelingssamenwerking.

De Koning Boudewijnstichting beschikt over heel wat kwantitatieve en kwalitatieve data over haar projecten. Dit is een enorme bron aan informatie, die op haar beurt weer input kan geven voor nieuwe initiatieven.

De Koning Boudewijnstichting is op zoek naar een

Contract van twee jaar

– je hebt ervaring met big data en structureren van data bv. Python (pandas, numpy), R, Spark, ….
– je kan een strategie definiëren om inzichten uit gegevens te extraheren
– je kan data consolideren en harmoniseren
– je kan datamodellen en data-analyses ontwikkelen
– je hebt interesse in de maatschappelijke ontwikkelingen van de actiedomeinen van de Stichting
– je wil graag de bron aan informatie van de Koning Boudewijnstichting beter en meer exploreren maar ook naar extra data op zoek gaan en de cases op een out-of-the-box manier aanpakken
– je hebt een ondernemende en enthousiasmerende attitude, identificeert opportuniteiten en neemt initiatief
– je werkt autonoom en hebt duidelijke verantwoordelijkheden, maar bent tezelfdertijd een teamplayer

Je hebt
– een master diploma in informatica, informatiemanagement, computerwetenschappen, statistiek of gelijkaardig
– bijzondere affiniteit met de thema’s van de Stichting
– uitstekende intermenselijke vaardigheden
– mondelinge en redactionele vaardigheden
– een gezonde dosis emotionele en sociale intelligentie
– talenkennis : Nederlands als moedertaal en een zeer goede kennis van het Frans en het Engels
– kennis van de courante softwarepakketten Word, Outlook, Excel, Powerpoint en sociale media

Wij bieden

– een voltijds contract van bepaalde duur (twee jaar)
– een inhoudelijk boeiende en gevarieerde job met veel autonomie en contacten
– een open, dynamische en menselijke sfeer
– een aantrekkelijk salaris met voordelen
– een flexibele werkregeling (glijdende uren, telewerk)

De Koning Boudewijnstichting wil de samenleving waarvoor zij zich inzetten zo goed mogelijk weerspiegelen. Kandidaten worden dan ook geselecteerd op basis van hun kwaliteiten en vaardigheden, ongeacht geslacht, afkomst of beperking. Als je een beperking hebt, laat ons dat vooraf weten. We zorgen ervoor dat je in de best mogelijke omstandigheden kunt deelnemen aan de selecties.

Meer jobs? kom dan naar de JobFair op 28 September.

Schrijf je in voor de JobFair van de data science Community dat doorgaat op DigitYser op 28 September.

#datascience JobFair will take place at Digityser on September 28th – come and meet 20 recruiting companies and over 70 candidates.
Companies can book their place by sending an e-mail to
Jobseekers can book their place though EventBrite:

Interesse om rechtstreeks te soliciteren?

Stuur uw brief + c.v. tegen uiterlijk 30 september 2017 naar mevrouw Pascale Criekemans, Brederodestraat 21 in 1000 Brussel of per e-mail :
Voor meer info kan u bellen naar 02/549 02 14 of op

Executive Workshop on Leadership and Digital Transformation


DiS17_Speaker_Stephen Brobst

Every industry will be transformed by the new business models and revolutionary possibilities created by digital society in the 21stcentury.

This talk will address the critical success factors that leaders must embrace when transforming an enterprise into a player for the digital age. 

We will discuss the importance of data-driven business models to transform from traditional customer relationship management to customer experience management in the new world of digitally enabled customers.

We will present a framework that will emphasize the importance of real-time recommendation engines leveraging operational intelligence using self-learning algorithms with techniques drawn from the world of artificial intelligence and machine learning.

We will also propose an approach for monetizing data as a critical success factor for all enterprises who want to be successful in the digital age.  Issues such as enabling cultural change, organizational skill set requirements, and governance will also be discussed.  Case study examples of organizations that have been successful in executing digital transformation will be used throughout the presentation.

About Stephen:

Stephen Brobst is the Chief Technology Officer for Teradata Corporation.  Stephen performed his graduate work in Computer Science at the Massachusetts Institute of Technology where his Masters and PhD research focused on high-performance parallel processing. He also completed an MBA with joint course and thesis work at the Harvard Business School and the MIT Sloan School of Management.  Stephen is a TDWI Fellow and has been on the faculty of The Data Warehousing Institute since 1996.  During Barack Obama’s first term he was also appointed to the Presidential Council of Advisors on Science and Technology (PCAST) in the working group on Networking and Information Technology Research and Development (NITRD).  He was recently ranked by ExecRank as the #4 CTO in the United States (behind the CTOs from, Tesla Motors, and Intel) out of a pool of 10,000+ CTOs.

Take aways from his presentation:

  • Learn about the critical success factors for competing effectively in the digital world.
  • Learn about the cultural and organizational skill set requirements for transforming into a digital business.
  • Learn how to use effective governance to transform from a traditional business model to a digitally enabled business model while sustaining profitable operations.


  • 30 March 2017 14:45 during the Data Innovation Summit
  • DISUMMIT @ING Marnix, Troonstraat 1, 1000 Brussel

Stephen’s previous presentation in Belgium:

Stephen gave an executive session at the Hub in the summer of 2016. He then shared  his views on the importance of open data, open source, analytics in the cloud and data science. Over 100 executives left the workshop that day inspired and armed with some actionable ideas that helped them define a profitable strategy for their data science teams.

Here is the link to his presentation:

Who might be interested:

  • Executives and directors involved in a digital transformation projects
  • Change and innovation managers
  • Specialists focussing on data innovation and management
  • Students aspiring to becoming one or all of the above

Register your seat now:

Here is the link to his presentation:

Other Executive sessions during di-Summit:

  • 29 March 16:00 – 3h Workshop from Dirk Borne: Communicating Data Literacy and the Value of Data to Clients and Colleagues
  • 30 March 10:45 – 2h Workshop from Geert Verstraeten: Introducing Predictive Analytics 
  • 30 March 13:45 – 1h Workshop from Natalino Busa: Positioning Open Source in your existing software architecture
  • 30 March 17:00 – Closing Keynote from Kirk Borne: A Data-rich World for a Better World: From Sensors to Sense-Making
  • Please check form more exciting presentations.


How Tom won his first Kaggle competition

tom wins kaggle

This is a copy of Tom’s original post on Github.

Winning approach of the Facebook V Kaggle competition

The Facebook V: Predicting Check Ins data science competition where the goal was to predict which place a person would like to check in to has just ended. I participated with the goal of learning as much as possible and maybe aim for a top 10% since this was my first serious Kaggle competition attempt. I managed to exceed all expectations and finish 1st out of 1212 participants! In this post, I’ll explain my approach.


This blog post will cover all sections to go from the raw data to the winning submission. Here’s an overview of the different sections. If you want to skip ahead, just click the section title to go there.

The R source code is available on GitHub. This thread on the Kaggle forum discusses the solution on a higher level and is a good place to start if you participated in the challenge.


Competition banner

Competition banner

From the competition page: The goal of this competition is to predict which place a person would like to check in to. For the purposes of this competition, Facebook created an artificial world consisting of more than 100,000 places located in a 10 km by 10 km square. For a given set of coordinates, your task is to return a ranked list of the most likely places. Data was fabricated to resemble location signals coming from mobile devices, giving you a flavor of what it takes to work with real data complicated by inaccurate and noisy values. Inconsistent and erroneous location data can disrupt experience for services like Facebook Check In.

The training data consists of approximately 29 million observations where the location (x, y), accuracy, and timestamp is given along with the target variable, the check in location. The test data contains 8.6 million observations where the check in location should be predicted based on the location, accuracy and timestamp. The train and test data set are split based on time. There is no concept of a person in this dataset. All the observations are events, not people.

A ranked list of the top three most likely places is expected for all test records. The leaderboard score is calculated using the MAP@3 criterion. Consequently, ranking the actual place as the most likely candidate gets a score of 1, ranking the actual place as the second most likely gets a score of 1/2 and a third rank of the actual place results in a score of 1/3. If the actual place is not in the top three of ranked places, a score of 0 is awarded for that record. The total score is the mean of the observation scores.

Check Ins where each place has a different color

Check Ins where each place has a different color

Exploratory analysis

Location analysis of the train check ins revealed interesting patterns between the variation in x and y. There appears to be way more variation in x than in y. It was suggested that this could be related to the streets of the simulated world. The difference in variation between x and y is however different for all places and there is no obvious spatial (x-y) pattern in this relationship.

It was quickly established by the community that time is measured in minutes and could thus be converted to relative hours and days of the week. This means that the train data covers 546 days and the test data spans 153 days. All places seem to live in independent time zones with clear hourly and daily patterns. No spatial pattern was found with respect to the time patterns. There are however two clear dips in the number of check ins during the train period.

Accuracy was by far the hardest input to interpret. It was expected that it would be clearly correlated with the variation in x and y but the pattern is not as obvious. Halfway through the competition I cracked the code and the details will be discussed in the Feature engineering section.

I wrote an interactive Shiny application to research these interactions for a subset of the places. Feel free to explore the data yourself!

Problem definition

The main difficulty of this problem is the extended number of classes (places). With 8.6 million test records there are about a trillion (10^12) place-observation combinations. Luckily, most of the classes have a very low conditional probability given the data (x, y, time and accuracy). The major strategy on the forum to reduce the complexity consisted of calculating a classifier for many x-y rectangular grids. It makes much sense to make use of the spatial information since this shows the most obvious and strong pattern for the different places. This approach makes the complexity manageable but is likely to lose a significant amount of information since the data is so variable. I decided to model the problem with a single binary classification model in order to avoid to end up with many high variance models. The lack of any major spatial patterns in the exploratory analysis supports this approach.


Generating a single classifier for all place-observation combinations would be infeasible even with a powerful cluster. My approach consists of a stepwise strategy in which the conditional place probability is only modeled for a set of place candidates. A simplification of the overall strategy is shown below

High level strategy

High level strategy

The given raw train data is split in two chronological parts, with a similar ratio as the ratio between the train and test data. The summary period contains all given train observations of the first 408 days (minutes 0-587158). The second part of the given train data contains the next 138 days and will be referred to as the train/validation data from now on. The test data spans 153 days as mentioned before.

The summary period is used to generate train and validation features and the given train data is used to generate the same features for the test data.

The three raw data groups (train, validation and test) are first sampled down into batches that are as large as possible but can still be modeled with the available memory. I ended up using batches of approximately 30,000 observations on a 48GB workstation. The sampling process is fully random and results in train/validation batches that span the entire 138 days’ train range.

Next, a set of models is built to reduce the number of candidates to 20 using 15 XGBoost models in the second candidate selection step. The conditional probability P(place_match|features) is modeled for all ~30,000*100 place-observation combinations and the mean predicted probability of the 15 models is used to select the top 20 candidates for each observation. These models use features that combine place and observation measures of the summary period.

The same features are used to generate the first level learners. Each of the 100 first level learners are again XGBoost models that are built using ~30,000*20 feature-place_match pairs. The predicted probabilities P(place_match|features) are used as features of the second level learners along with 21 manually selected features. The candidates are ordered using the mean predicted probabilities of the 30 second level XGBoost learners.

All models are built using different train batches. Local validation is used to tune the model hyperparameters.

Candidate selection 1

The first candidate selection step reduces the number of potential classes from >100K to 100 by considering nearest neighbors of the observations. I considered the neighbor counts of the 2500 nearest neighbors where y variations are 2.5 times more important than x variations. Ties in the neighbor counts are resolved by the mean time difference since the observations. Resolving ties with the mean time difference is motivated by the shifts in popularity of the places.

The nearest neighbor counts are calculated efficiently by splitting up the data in overlapping rectangular grids. Grids are created as small as possible while still guaranteeing that the 2500 nearest neighbors fall within the grid in the worst case scenario. The R code is suboptimal through the use of several loops but the major bottleneck (ordering the distances) was reduced by a custom Rcpp package which resulted in an approximate 50% speed up. Improving the logic further was no major priority since the features were calculated on the background.

Feature engineering

Feature engineering strategy

Three weeks into the eight-week competition, I climbed to the top of the public leaderboard with about 50 features. Ever since I kept thinking of new features to capture the underlying patterns of the data. I also added features that are similar to the most important features in order to capture the subtler patterns. The final model contains 430 numeric features and this section is intended to discuss the majority of them.

There are two types of features. The first category relates to features that are calculated using only the summary data such as the number of historical check ins. The second and largest category combines summary data of the place candidates with the observation data. One such example is the historical density of a place candidate, one year prior to the observation.

All features are rescaled if needed in order to result in similar interpretations for the train and test features.


The major share of my 430 features is based on nearest neighbor related features. The neighbor counts for different Ks (1, 5, 10, 20, 50, 100, 250, 500, 1000 and 2500) and different x-y ratio constants (1, 2.5, 4, 5.5, 7, 12 and 30) resulted in 10*7 features. For example: if a test observation has 3 of its 5 nearest neighbors of class A and 2 of its 5 nearest neighbors as class B, the candidate A will contain the numeric value of 3 for the K=5 feature, the candidate B will contain the numeric value of 2 for the K=5 feature and all other 18 candidates will contain the value of 0 for that feature. The mean time difference between a candidate and all 70 combinations resulted in 70 additional features. 10 more features were added by considering the distance between the Kth features and the observations for a ratio constant of 2.5. These features are an indication of the spatial density. 40 more features were added in a later iteration around the most significant nearest neighbor features. K was set at (35, 75, 100, 175, 375) for x-y ratio constants (0.4, 0.8, 1.3, 2, 3.2, 4.5, 6 and 8). The distances of all 40 combinations to the most distant neighbor were also added as features. Distance features are divided by the number of summary observations in order to have similar interpretations for the train and test features.

I further added several features that consider the (smoothed) spatial grid densities. Other location features relate to the place summaries such as the median absolute deviations and standard deviations in x and y. The ratio between the median absolute deviations was added as well. Features were relaxed using additive (Laplace) smoothing with different relaxation constants whenever it made sense using the relaxation constants 20 and 300. Consequently, the relaxed mad for a place with 300 summary observation is equal to the mean of the place mad and the weighted place population mad for a relaxation constant of 300.


The second largest share of the features set belongs to time features. Here I converted all time period counts to period density counts in order to handle the two drops in the time frequency. Periods include 27 two-week periods prior to the end of the summary data and 27 1-week periods prior to the end of the summary data. I also included features that look at the two-week densities looking back between 75 and 1 weeks from the observations. These features resulted in missing values but XGBoost is able to handle them. Additional features were added for the clear yearly pattern of some places.

Weekly counts

Weekly counts

Hour, day and week features were calculated using the historical densities with and without cyclical smoothing and with or without relaxation. I suspected an interaction between the hour of the day and the day of the week and also added cyclical hour-day features. Features were added for daily 15-minute intervals as well. The cyclical smoothing is applied with Gaussian windows. The windows were chosen such that the smoothed hour, hour-week and 15-minute blocks capture different frequencies.

Other time features include extrapolated weekly densities using various time series models (arima, Holt-Winters and exponential smoothing). Further, the time since the end of the summary period was also added as well as the time between the end of the summary period and the last check in.


Understanding accuracy was the result of generating many plots. There is a significant but low correlation between accuracy and the variation in x and y but it is not until accuracy is binned in approximately equal sizes that the signal becomes visible. The signal is more accurate for accuracies in the 45-84 range (GPS data?).

Mean variation from the median in x versus 6 time and 32 accuracy groups

Mean variation from the median in x versus 6 time and 32 accuracy groups

The accuracy distribution seems to be a mixed distribution with three peaks which changes over time. It is likely to be related to three different mobile connection types (GPS, Wi-Fi or cellular). The places show different accuracy patterns and features were added to indicate the relative accuracy group densities. The middle accuracy group was set to the 45-84 range. I added relative place densities for 3 and 32 approximately equally sized accuracy bins. It was also discovered that the location is related to the three accuracy groups for many places. This pattern was captured by the addition of additional features for the different accuracy groups. A natural extension to the nearest neighbor calculation would incorporate the accuracy group but I did no longer have time to implement it.

The x-coordinates seem to be related to the accuracy group for places like 8170103882

The x-coordinates seem to be related to the accuracy group for places like 8170103882


Tens of z scores were added to indicate how similar a new observation is to the historical patterns in the place candidates. Robust Z-scores ((f-median(f))/mad(f) instead of (f-mean(f))/sd(f)) gave the best results.

Most important features

Nearest neighbors are the most important features for the studied models. The most significant nearest neighbor features appear around K=100 for distance constant ratios around 2.5. Hourly and daily densities were all found to be very important as well and the highest feature ranks are obtained after smoothing. Relative densities of the three accuracy groups also appear near the top of the most important features. An interesting feature that also appears at the top of the list relates to the daily density 52 weeks prior to the check in. There is a clear yearly pattern which is most obvious for places with the highest daily counts.

Clear yearly pattern for place 5872322184. The green line goes back 52 weeks since the highest daily count

Clear yearly pattern for place 5872322184. The green line goes back 52 weeks since the highest daily count

The feature files are about 800MB for each batch and I saved all the features to an external HD.

Candidate selection 2

The features from the previous section are used to generate binary classification models on 15 different train batches using XGBoost models. With 100 candidates for each observation, this is a slow process and it made sense to me to narrow down the number of candidates to 20 at this stage. I did not perform any downsampling in my final approach since all zeros (not a match between the candidate and true match) contain valuable information. XGBoost is able to handle unbalanced data quite well in my experience. I did however consider to omit observations that didn’t contain the true class in the top 100 but this resulted in slightly worse validation scores. The reasoning is the same as above: those values contain valuable information! The 15 candidate selection models are built with the top 142 features. The feature importance order is obtained by considering the XGBoost feature importance ranks of 20 models trained on different batches. Hyperparameters were selected using the local validation batches. The 15 second candidate selection models all generate a predicted probability of P(place_match|data), I average those to select the top 20 candidates in the second candidate selection step.

At this point I also dropped observations that belong to places that only have observations in the train/validation period. This filtering was also applied to the test set.

First level learners

The first level learners are very similar to the second candidate selection models other than the fact that they were fit on one fifth of the data for 75 of the 100 models. The other 25 models were fit on 100 candidates for each observation. The 100 base XGBoost learners were fit on different random parts of the training period. Deep trees gave me the best results here (depth 11) and the eta constant was set to (11 or 12)/500 for 500 rounds. Column sampling also helped (0.6) and subsampling the observations (0.5) did not hurt but of course resulted in a fitting speed increase. I included either all 430 features or a uniform random pick of the ordered features by importance in a desirable feature count range (100-285 and 180-240). The first level learner framework was created to handle multiple first level learner types other than XGBoost. I experimented with the nnet and H2O neural network implementations but those were either too slow in transferring the data (H2O) or too biased (nnet). The way XGBoost handles missing values is another great advantage over the mentioned neural network implementations. Also, the XGBoost models are quite variable since they are trained on different random train batches with differing hyperparameters (eta constant, number of features and the number of considered candidates (either 20 or 100)).

Second level learners

The 30 second level learners combine the predictions of the 100 first level models along with 21 manually selected features for all top 20 candidates. The 21 additional features are high level features such as the x, y and accuracy values as well as the time since the end of the summary period. The added value of the 21 features is very low but constant on the validation set and the public leaderboard (~0.02%). The best local validation score was obtained by considering moderate tree depths (depth 7) and the eta constant was set to 8/200 for 200 rounds. Column sampling also helped (0.6) and subsampling the observations (0.5) did not hurt but again resulted in a fitting speed increase. The candidates are ordered using the mean predicted probabilities of the 30 second level XGBoost learners.

Analysis of the local MAP@3 indicated better results for accuracies in the 45-84 range. The difference between local and test validation scores is in large part related to this observation. There seems to be a trend towards the use of devices that show less variation .

Local MAP@3 versus accuracy groups

Local MAP@3 versus accuracy groups


The private leaderboard standing below, used to rank the teams, shows the top 30 teams. It was a very close competition in the end and Markus would have been a well-deserved winner as well. We were very close to each other ever since the third week of the eight-week contest and pushed each other forward. The fact that the test data contains 8.6 million records and that it was split randomly for the private and public leaderboard resulted in a very confident estimate of the private standing given the public leaderboard. I was most impressed by the approaches of Markus and Jack (Japan) who finished in third position. You can read more about their approaches on the forum. Many others also contributed valuable insights.

Private leaderboard score (MAP@3) - two teams stand out from the pack

Private leaderboard score (MAP@3) – two teams stand out from the pack

I started the competition using a modest 8GB laptop but decided to purchase a €1500 workstation two weeks into the competition to speed up the modeling. Starting with limited resources ended up to be an advantage since it forced me to think of ways to optimize the feature generation logic. My big friend in this competition was the data.table package.

Running all steps on my 48GB workstation would take about a month. That seems like a ridiculously long time but it is explained by the extended computation time of the nearest neighbor features. While calculating the NN features I was continuously working on other parts of the workflow so speeding the NN logic up would not have resulted in a better final score.

Generating a ~.62 score could however be achieved in about two weeks by focusing on the most relevant NN features. I would suggest to consider 3 of the 7 distance constants (1, 2.5 and 4) and omit the mid KNN features. Cutting the first level models from 100 to 10 and the second level models from 30 to 5 would also not result in a strong performance decrease (estimated decrease of 0.1%) and cut the computation time to less than a week. You could of course run the logic on multiple instances and further speed things up.

I really enjoyed working on this competition even though it was already one of the busiest periods of my life. The competition was launched while I was in the middle of writing my Master’s Thesis in statistics in combination with a full time job. The data shows many interesting noisy and time dependent patterns which motivated me to play with the data before and after work. It was definitely worth every second of my time! I was inspired by the work of other Kaggle winners and successfully implemented my first two level model. Winning the competition is a nice extra but it’s even better to have learnt a lot from the other competitors, thank you all!

I look forward to your comments and suggestions, please go to my original post to post your comments.


How to innovate in the Age of Big Data presented by Stephen Brobst


Executive Summer Session

Stephen Brobst will be at the European Data Innovation Hub. We asked him to share his views on the importance of open data, open source, analytics in the cloud and data science. Stephen is on the forefront of the technology and we can’t wait to hear what is happening in the Silicon Valley. Count on it that you will leave the workshop inspired and weaponed with some actionable ideas that will help us to define a profitable strategy for the data science teams.

Format of the session :

  • 15:00 – Keynote:How to innovate in the Age of Big Data
  • 15:50 – Open Discussion on “Sustainable Strategies for Data Science, tackling following topics:
  • Data Science is the Key to Business Success
  • Three Critical Technologies Necessary for Big Data Exploitation
  • How to Innovate in the Age of Big Data
  • 16:45 – Networking Session

Stephen Brobst is the Chief Technology Officer for Teradata Corporation.  Stephen performed his graduate work in Computer Science at the Massachusetts Institute of Technology where his Masters and PhD research focused on high-performance parallel processing. He also completed an MBA with joint course and thesis work at the Harvard Business School and the MIT Sloan School of Management.  Stephen is a TDWI Fellow and has been on the faculty of The Data Warehousing Institute since 1996.  During Barack Obama’s first term he was also appointed to the Presidential Council of Advisors on Science and Technology (PCAST) in the working group on Networking and Information Technology Research and Development (NITRD).  He was recently ranked by ExecRank as the #4 CTO in the United States (behind the CTOs from, Tesla Motors, and Intel) out of a pool of 10,000+ CTOs.

Job – Junior Data Scientist

Screenshot 2016-07-01 12.02.02

Are you pursuing a career in data science?

We have a great opportunity for you: an intensive training program combined with interesting job opportunities!

Interested? Check out follow the link to our datascience survey and send your cv to

Once selected, you’ll be invited for the intake event that will take place in Brussels this summer.

Hope to see you there,

Nele & Philippe

job – Python Predictions – data scientists.

Screenshot 2016-06-30 13.51.09

Hi Philippe,

We’re looking for some great new people again.
Would be great if you could give us some visibility for our search.
Candidates can simply send CV and (e)mail of motivation to
More details in the links or text below
Data Scientist
Python Predictions – Bruxelles Woluwe-Saint-Pierre
Python Predictions is a Brussels-based consulting firm founded in 2006 and specialized in data science and predictive analytics. We are currently looking for data scientists.


  • In-company data science projects for our clients
  • Contribute to explorative, descriptive and predictive analysis

Required skills or education

  • Proven interest and skills in data science and analytics
  • Proven interest and skills in at least one analytical programming language
  • Work flexibly in rapidly changing environments
  • Good visualisation and communication skills
  • Understand business problems


  • Analytical mindset
  • Open minded
  • Integrity
  • Critical of the output produced

Language skills

  • Working knowledge of Dutch, French and English

How to apply?
Send us your curriculum vitae and brief letter of motivation.We need both documents in order to consider your application.

More details

About us
Why should you apply for a position at Python Predictions? We believe we understand as no others what makes analysts tick. We believe that successful analysts must possess and develop a number of very distinct skills, ranging from social to technical, from intuitive to analytical. Putting these skills to work on real-life analytical projects is rewarding. And we provide a stimulating environment with focus on innovation and cooperation. Find our more about our activities on

Job Type: Full-time

Required languages:

  • Dutch
  • English
  • French

Job – MDCPartners – Senior Data Engineer

Screenshot 2016-06-27 14.55.11

Job Requirements

About MDCPartners

MDCPartners is a technology company based in Antwerp, Belgium that produces large volumes of healthcare data on a weekly basis. This requires skilled techies with an eye for data, and the ability to apply this knowledge in the field of healthcare. Our clients are top pharma companies that rely on our tools to make crucial decisions during drug development.


As Senior Data Engineer you work in the data lifting team to streamline and innovate data processing components and workflow. You deal with algorithmic, performance and operational tasks related to the main data flow in MDCPartners.

Your main target is to oversee parts of the data generation, improve quality of the data, the processing performance and downstream application possibilities by means of your architectural and algorithmic input.

Keywords: data analysis, algorithms, NLP, machine learning, Lucene, ontologies, medical data, performance, parallel programming


  • Have a Master’s degree or PhD in Computer Science
  • Have at least 4 years of proven experience in the field
  • Have fantastic Java skills
  • Know your way around (No)SQL databases
  • Be fluent in English
  • Have a no-nonsense problem solving mindset
  • Be eager to take technical and organizational responsibility
  • Learn quickly, and want to be challenged
  • Have the ability to support and mentor other data engineers

What we offer

  • A hi-tech, creative working environment in a dynamic, growing company
  • Career path to grow to a crucial role
  • Competitive salary & benefits
  • Company car or similar remuneration options possible
  • Health insurance package

How to apply

For further information or to apply for this vacancy, please contact us and include your CV.


Make sure that you are a member of the Brussels Data Science Community linkedin group before you apply. Join  here.

Please note that we also manage other vacancies that are not public, if you want us to bring you in contact with them too, just send your CV to .

For further information or to apply for this vacancy, please contact MDC and include your CV.

Analytics: Lessons Learned from Winston Churchill


I had the pleasure to be invited for lunch by Prof. Baessens earlier this week and we talked about a next meetup subject that could be ‘War and Analytics’. As you might know Bart  is a WWI fanatic and he has already written a nice article on the subject called ‘Analytics: Lessons Learned from Winston Churchill’

here is the article—

Nicolas Glady’s Activities

Activities Overview‎ > ‎Online articles‎ > ‎ Analytics: Lessons Learned from Winston Churchill

Analytics has been around for quite some time now.  Even during World War II, it proved critical for the Allied victory. Some famous examples of allied analytical activities include the decoding of the enigma code, which effectively removed the danger of submarine warfare, and the 3D reconstruction of 2D images shot by gunless Spitfires, which helped Intelligence at RAF Medmenham eliminate the danger of the V1 and V2 and support operation Overlord. Many of the analytical lessons learned at that time are now more relevant than ever, in particular those provided by one of the great victors of WWII, then Prime Minister, Sir Winston Churchill.

The phrase “I only believe in statistics that I doctored myself” is often attributed to him. However, while its wit is certainly typical of the Greatest Briton, it was probably a Nazi Propaganda invention. Even so, can Churchill still teach us something about statistical analyses and Analytics?


A good analytical model should satisfy several requirements depending upon the application area and follow a certain process. The CRISP-DM, a leading methodology to conduct data-driven analysis, proposes a structured approach: understand the business, understand the data, prepare the data, design a model, evaluate it, and deploy the solution. The wisdom of the 1953 Nobel Prize for literature can help us better understand this process.

Have an actionable approach: aim at solving a real business issue

Any analytics project should start with a business problem, and then provide a solution. Indeed, Analytics is not a purely technical, statistical or computational exercise, since any analytical model needs to be actionable. For example, a model can allow us to predict future problems like credit card fraud or customer churn rate. Because managers are decision-makers, as are politicians, they need “the ability to foretell what is going to happen tomorrow, next week, next month, and next year… And to have the ability afterwards to explain why it didn’t happen.” In other words, even when the model fails to predict what really happened, its ability to explain the process in an intelligible way is still crucial.

In order to be relevant for businesses, the parties concerned need first to define and qualify a problem before analysis can effectively find a solution. For example, trying to predict what will happen in 10 years or more makes little sense from a practical, day-to-day business perspective: “It is a mistake to look too far ahead. Only one link in the chain of destiny can be handled at a time.”  Understandably, many analytical models in use in the industry have prediction horizons spanning no further than 2-3 years.

Understand the data you have at your disposal

There is a fairly large gap between data and comprehension. Churchill went so far as to argue that “true genius resides in the capacity for evaluation of uncertain, hazardous, and conflicting information.”  Indeed, Big Data is complex and is not a quick-fix solution for most business problems. In fact, it takes time to work through and the big picture might even seem less clear at first. It is the role of the Business Analytics expert to really understand the data and know what sources and variables to select.

Prepare the data

Once a complete overview of the available data has been drafted, the analyst will start preparing the tables for modelling by consolidating different sources, selecting the relevant variables and cleaning the data sets. This is usually a very time-consuming and tedious task, but needs to be done: “If you’re going through hell, keep going.”

Never forget to consider as much past historical information as you can. Typically, when trying to predict future events, using past transactional data is very relevant as most of the predictive power comes from this type of information. “The longer you can look back, the farther you can look forward.”

read more here