How Tom won his first Kaggle competition

tom wins kaggle

This is a copy of Tom’s original post on Github.

Winning approach of the Facebook V Kaggle competition

The Facebook V: Predicting Check Ins data science competition where the goal was to predict which place a person would like to check in to has just ended. I participated with the goal of learning as much as possible and maybe aim for a top 10% since this was my first serious Kaggle competition attempt. I managed to exceed all expectations and finish 1st out of 1212 participants! In this post, I’ll explain my approach.

Overview

This blog post will cover all sections to go from the raw data to the winning submission. Here’s an overview of the different sections. If you want to skip ahead, just click the section title to go there.

The R source code is available on GitHub. This thread on the Kaggle forum discusses the solution on a higher level and is a good place to start if you participated in the challenge.

Introduction

Competition banner

Competition banner

From the competition page: The goal of this competition is to predict which place a person would like to check in to. For the purposes of this competition, Facebook created an artificial world consisting of more than 100,000 places located in a 10 km by 10 km square. For a given set of coordinates, your task is to return a ranked list of the most likely places. Data was fabricated to resemble location signals coming from mobile devices, giving you a flavor of what it takes to work with real data complicated by inaccurate and noisy values. Inconsistent and erroneous location data can disrupt experience for services like Facebook Check In.

The training data consists of approximately 29 million observations where the location (x, y), accuracy, and timestamp is given along with the target variable, the check in location. The test data contains 8.6 million observations where the check in location should be predicted based on the location, accuracy and timestamp. The train and test data set are split based on time. There is no concept of a person in this dataset. All the observations are events, not people.

A ranked list of the top three most likely places is expected for all test records. The leaderboard score is calculated using the MAP@3 criterion. Consequently, ranking the actual place as the most likely candidate gets a score of 1, ranking the actual place as the second most likely gets a score of 1/2 and a third rank of the actual place results in a score of 1/3. If the actual place is not in the top three of ranked places, a score of 0 is awarded for that record. The total score is the mean of the observation scores.

Check Ins where each place has a different color

Check Ins where each place has a different color

Exploratory analysis

Location analysis of the train check ins revealed interesting patterns between the variation in x and y. There appears to be way more variation in x than in y. It was suggested that this could be related to the streets of the simulated world. The difference in variation between x and y is however different for all places and there is no obvious spatial (x-y) pattern in this relationship.

It was quickly established by the community that time is measured in minutes and could thus be converted to relative hours and days of the week. This means that the train data covers 546 days and the test data spans 153 days. All places seem to live in independent time zones with clear hourly and daily patterns. No spatial pattern was found with respect to the time patterns. There are however two clear dips in the number of check ins during the train period.

Accuracy was by far the hardest input to interpret. It was expected that it would be clearly correlated with the variation in x and y but the pattern is not as obvious. Halfway through the competition I cracked the code and the details will be discussed in the Feature engineering section.

I wrote an interactive Shiny application to research these interactions for a subset of the places. Feel free to explore the data yourself!

Problem definition

The main difficulty of this problem is the extended number of classes (places). With 8.6 million test records there are about a trillion (10^12) place-observation combinations. Luckily, most of the classes have a very low conditional probability given the data (x, y, time and accuracy). The major strategy on the forum to reduce the complexity consisted of calculating a classifier for many x-y rectangular grids. It makes much sense to make use of the spatial information since this shows the most obvious and strong pattern for the different places. This approach makes the complexity manageable but is likely to lose a significant amount of information since the data is so variable. I decided to model the problem with a single binary classification model in order to avoid to end up with many high variance models. The lack of any major spatial patterns in the exploratory analysis supports this approach.

Strategy

Generating a single classifier for all place-observation combinations would be infeasible even with a powerful cluster. My approach consists of a stepwise strategy in which the conditional place probability is only modeled for a set of place candidates. A simplification of the overall strategy is shown below

High level strategy

High level strategy

The given raw train data is split in two chronological parts, with a similar ratio as the ratio between the train and test data. The summary period contains all given train observations of the first 408 days (minutes 0-587158). The second part of the given train data contains the next 138 days and will be referred to as the train/validation data from now on. The test data spans 153 days as mentioned before.

The summary period is used to generate train and validation features and the given train data is used to generate the same features for the test data.

The three raw data groups (train, validation and test) are first sampled down into batches that are as large as possible but can still be modeled with the available memory. I ended up using batches of approximately 30,000 observations on a 48GB workstation. The sampling process is fully random and results in train/validation batches that span the entire 138 days’ train range.

Next, a set of models is built to reduce the number of candidates to 20 using 15 XGBoost models in the second candidate selection step. The conditional probability P(place_match|features) is modeled for all ~30,000*100 place-observation combinations and the mean predicted probability of the 15 models is used to select the top 20 candidates for each observation. These models use features that combine place and observation measures of the summary period.

The same features are used to generate the first level learners. Each of the 100 first level learners are again XGBoost models that are built using ~30,000*20 feature-place_match pairs. The predicted probabilities P(place_match|features) are used as features of the second level learners along with 21 manually selected features. The candidates are ordered using the mean predicted probabilities of the 30 second level XGBoost learners.

All models are built using different train batches. Local validation is used to tune the model hyperparameters.

Candidate selection 1

The first candidate selection step reduces the number of potential classes from >100K to 100 by considering nearest neighbors of the observations. I considered the neighbor counts of the 2500 nearest neighbors where y variations are 2.5 times more important than x variations. Ties in the neighbor counts are resolved by the mean time difference since the observations. Resolving ties with the mean time difference is motivated by the shifts in popularity of the places.

The nearest neighbor counts are calculated efficiently by splitting up the data in overlapping rectangular grids. Grids are created as small as possible while still guaranteeing that the 2500 nearest neighbors fall within the grid in the worst case scenario. The R code is suboptimal through the use of several loops but the major bottleneck (ordering the distances) was reduced by a custom Rcpp package which resulted in an approximate 50% speed up. Improving the logic further was no major priority since the features were calculated on the background.

Feature engineering

Feature engineering strategy

Three weeks into the eight-week competition, I climbed to the top of the public leaderboard with about 50 features. Ever since I kept thinking of new features to capture the underlying patterns of the data. I also added features that are similar to the most important features in order to capture the subtler patterns. The final model contains 430 numeric features and this section is intended to discuss the majority of them.

There are two types of features. The first category relates to features that are calculated using only the summary data such as the number of historical check ins. The second and largest category combines summary data of the place candidates with the observation data. One such example is the historical density of a place candidate, one year prior to the observation.

All features are rescaled if needed in order to result in similar interpretations for the train and test features.

Location

The major share of my 430 features is based on nearest neighbor related features. The neighbor counts for different Ks (1, 5, 10, 20, 50, 100, 250, 500, 1000 and 2500) and different x-y ratio constants (1, 2.5, 4, 5.5, 7, 12 and 30) resulted in 10*7 features. For example: if a test observation has 3 of its 5 nearest neighbors of class A and 2 of its 5 nearest neighbors as class B, the candidate A will contain the numeric value of 3 for the K=5 feature, the candidate B will contain the numeric value of 2 for the K=5 feature and all other 18 candidates will contain the value of 0 for that feature. The mean time difference between a candidate and all 70 combinations resulted in 70 additional features. 10 more features were added by considering the distance between the Kth features and the observations for a ratio constant of 2.5. These features are an indication of the spatial density. 40 more features were added in a later iteration around the most significant nearest neighbor features. K was set at (35, 75, 100, 175, 375) for x-y ratio constants (0.4, 0.8, 1.3, 2, 3.2, 4.5, 6 and 8). The distances of all 40 combinations to the most distant neighbor were also added as features. Distance features are divided by the number of summary observations in order to have similar interpretations for the train and test features.

I further added several features that consider the (smoothed) spatial grid densities. Other location features relate to the place summaries such as the median absolute deviations and standard deviations in x and y. The ratio between the median absolute deviations was added as well. Features were relaxed using additive (Laplace) smoothing with different relaxation constants whenever it made sense using the relaxation constants 20 and 300. Consequently, the relaxed mad for a place with 300 summary observation is equal to the mean of the place mad and the weighted place population mad for a relaxation constant of 300.

Time

The second largest share of the features set belongs to time features. Here I converted all time period counts to period density counts in order to handle the two drops in the time frequency. Periods include 27 two-week periods prior to the end of the summary data and 27 1-week periods prior to the end of the summary data. I also included features that look at the two-week densities looking back between 75 and 1 weeks from the observations. These features resulted in missing values but XGBoost is able to handle them. Additional features were added for the clear yearly pattern of some places.

Weekly counts

Weekly counts

Hour, day and week features were calculated using the historical densities with and without cyclical smoothing and with or without relaxation. I suspected an interaction between the hour of the day and the day of the week and also added cyclical hour-day features. Features were added for daily 15-minute intervals as well. The cyclical smoothing is applied with Gaussian windows. The windows were chosen such that the smoothed hour, hour-week and 15-minute blocks capture different frequencies.

Other time features include extrapolated weekly densities using various time series models (arima, Holt-Winters and exponential smoothing). Further, the time since the end of the summary period was also added as well as the time between the end of the summary period and the last check in.

Accuracy

Understanding accuracy was the result of generating many plots. There is a significant but low correlation between accuracy and the variation in x and y but it is not until accuracy is binned in approximately equal sizes that the signal becomes visible. The signal is more accurate for accuracies in the 45-84 range (GPS data?).

Mean variation from the median in x versus 6 time and 32 accuracy groups

Mean variation from the median in x versus 6 time and 32 accuracy groups

The accuracy distribution seems to be a mixed distribution with three peaks which changes over time. It is likely to be related to three different mobile connection types (GPS, Wi-Fi or cellular). The places show different accuracy patterns and features were added to indicate the relative accuracy group densities. The middle accuracy group was set to the 45-84 range. I added relative place densities for 3 and 32 approximately equally sized accuracy bins. It was also discovered that the location is related to the three accuracy groups for many places. This pattern was captured by the addition of additional features for the different accuracy groups. A natural extension to the nearest neighbor calculation would incorporate the accuracy group but I did no longer have time to implement it.

The x-coordinates seem to be related to the accuracy group for places like 8170103882

The x-coordinates seem to be related to the accuracy group for places like 8170103882

Z-scores

Tens of z scores were added to indicate how similar a new observation is to the historical patterns in the place candidates. Robust Z-scores ((f-median(f))/mad(f) instead of (f-mean(f))/sd(f)) gave the best results.

Most important features

Nearest neighbors are the most important features for the studied models. The most significant nearest neighbor features appear around K=100 for distance constant ratios around 2.5. Hourly and daily densities were all found to be very important as well and the highest feature ranks are obtained after smoothing. Relative densities of the three accuracy groups also appear near the top of the most important features. An interesting feature that also appears at the top of the list relates to the daily density 52 weeks prior to the check in. There is a clear yearly pattern which is most obvious for places with the highest daily counts.

Clear yearly pattern for place 5872322184. The green line goes back 52 weeks since the highest daily count

Clear yearly pattern for place 5872322184. The green line goes back 52 weeks since the highest daily count

The feature files are about 800MB for each batch and I saved all the features to an external HD.

Candidate selection 2

The features from the previous section are used to generate binary classification models on 15 different train batches using XGBoost models. With 100 candidates for each observation, this is a slow process and it made sense to me to narrow down the number of candidates to 20 at this stage. I did not perform any downsampling in my final approach since all zeros (not a match between the candidate and true match) contain valuable information. XGBoost is able to handle unbalanced data quite well in my experience. I did however consider to omit observations that didn’t contain the true class in the top 100 but this resulted in slightly worse validation scores. The reasoning is the same as above: those values contain valuable information! The 15 candidate selection models are built with the top 142 features. The feature importance order is obtained by considering the XGBoost feature importance ranks of 20 models trained on different batches. Hyperparameters were selected using the local validation batches. The 15 second candidate selection models all generate a predicted probability of P(place_match|data), I average those to select the top 20 candidates in the second candidate selection step.

At this point I also dropped observations that belong to places that only have observations in the train/validation period. This filtering was also applied to the test set.

First level learners

The first level learners are very similar to the second candidate selection models other than the fact that they were fit on one fifth of the data for 75 of the 100 models. The other 25 models were fit on 100 candidates for each observation. The 100 base XGBoost learners were fit on different random parts of the training period. Deep trees gave me the best results here (depth 11) and the eta constant was set to (11 or 12)/500 for 500 rounds. Column sampling also helped (0.6) and subsampling the observations (0.5) did not hurt but of course resulted in a fitting speed increase. I included either all 430 features or a uniform random pick of the ordered features by importance in a desirable feature count range (100-285 and 180-240). The first level learner framework was created to handle multiple first level learner types other than XGBoost. I experimented with the nnet and H2O neural network implementations but those were either too slow in transferring the data (H2O) or too biased (nnet). The way XGBoost handles missing values is another great advantage over the mentioned neural network implementations. Also, the XGBoost models are quite variable since they are trained on different random train batches with differing hyperparameters (eta constant, number of features and the number of considered candidates (either 20 or 100)).

Second level learners

The 30 second level learners combine the predictions of the 100 first level models along with 21 manually selected features for all top 20 candidates. The 21 additional features are high level features such as the x, y and accuracy values as well as the time since the end of the summary period. The added value of the 21 features is very low but constant on the validation set and the public leaderboard (~0.02%). The best local validation score was obtained by considering moderate tree depths (depth 7) and the eta constant was set to 8/200 for 200 rounds. Column sampling also helped (0.6) and subsampling the observations (0.5) did not hurt but again resulted in a fitting speed increase. The candidates are ordered using the mean predicted probabilities of the 30 second level XGBoost learners.

Analysis of the local MAP@3 indicated better results for accuracies in the 45-84 range. The difference between local and test validation scores is in large part related to this observation. There seems to be a trend towards the use of devices that show less variation .

Local MAP@3 versus accuracy groups

Local MAP@3 versus accuracy groups

Conclusion

The private leaderboard standing below, used to rank the teams, shows the top 30 teams. It was a very close competition in the end and Markus would have been a well-deserved winner as well. We were very close to each other ever since the third week of the eight-week contest and pushed each other forward. The fact that the test data contains 8.6 million records and that it was split randomly for the private and public leaderboard resulted in a very confident estimate of the private standing given the public leaderboard. I was most impressed by the approaches of Markus and Jack (Japan) who finished in third position. You can read more about their approaches on the forum. Many others also contributed valuable insights.

Private leaderboard score (MAP@3) - two teams stand out from the pack

Private leaderboard score (MAP@3) – two teams stand out from the pack

I started the competition using a modest 8GB laptop but decided to purchase a €1500 workstation two weeks into the competition to speed up the modeling. Starting with limited resources ended up to be an advantage since it forced me to think of ways to optimize the feature generation logic. My big friend in this competition was the data.table package.

Running all steps on my 48GB workstation would take about a month. That seems like a ridiculously long time but it is explained by the extended computation time of the nearest neighbor features. While calculating the NN features I was continuously working on other parts of the workflow so speeding the NN logic up would not have resulted in a better final score.

Generating a ~.62 score could however be achieved in about two weeks by focusing on the most relevant NN features. I would suggest to consider 3 of the 7 distance constants (1, 2.5 and 4) and omit the mid KNN features. Cutting the first level models from 100 to 10 and the second level models from 30 to 5 would also not result in a strong performance decrease (estimated decrease of 0.1%) and cut the computation time to less than a week. You could of course run the logic on multiple instances and further speed things up.

I really enjoyed working on this competition even though it was already one of the busiest periods of my life. The competition was launched while I was in the middle of writing my Master’s Thesis in statistics in combination with a full time job. The data shows many interesting noisy and time dependent patterns which motivated me to play with the data before and after work. It was definitely worth every second of my time! I was inspired by the work of other Kaggle winners and successfully implemented my first two level model. Winning the competition is a nice extra but it’s even better to have learnt a lot from the other competitors, thank you all!

I look forward to your comments and suggestions, please go to my original post to post your comments.

Tom.

How to innovate in the Age of Big Data presented by Stephen Brobst

stephen3

Executive Summer Session

Stephen Brobst will be at the European Data Innovation Hub. We asked him to share his views on the importance of open data, open source, analytics in the cloud and data science. Stephen is on the forefront of the technology and we can’t wait to hear what is happening in the Silicon Valley. Count on it that you will leave the workshop inspired and weaponed with some actionable ideas that will help us to define a profitable strategy for the data science teams.

Format of the session :

  • 15:00 – Keynote:How to innovate in the Age of Big Data
  • 15:50 – Open Discussion on “Sustainable Strategies for Data Science, tackling following topics:
  • Data Science is the Key to Business Success
  • Three Critical Technologies Necessary for Big Data Exploitation
  • How to Innovate in the Age of Big Data
  • 16:45 – Networking Session

Stephen Brobst is the Chief Technology Officer for Teradata Corporation.  Stephen performed his graduate work in Computer Science at the Massachusetts Institute of Technology where his Masters and PhD research focused on high-performance parallel processing. He also completed an MBA with joint course and thesis work at the Harvard Business School and the MIT Sloan School of Management.  Stephen is a TDWI Fellow and has been on the faculty of The Data Warehousing Institute since 1996.  During Barack Obama’s first term he was also appointed to the Presidential Council of Advisors on Science and Technology (PCAST) in the working group on Networking and Information Technology Research and Development (NITRD).  He was recently ranked by ExecRank as the #4 CTO in the United States (behind the CTOs from Amazon.com, Tesla Motors, and Intel) out of a pool of 10,000+ CTOs.

Job – Click@Bike – Senior Data Engineer

Click@Bike  is a promising start-up with European ambitions for the development and distribution of an innovative touristic cycling product-service offer for the hospitality sector. The Company is well capitalised and has extensive international support from the public sector.

To re-enforce its in-house IT development team, the Company is looking to hire a Senior Data Engineer.

About the role: Data Engineer

You will be a Senior Data Engineer responsible for all operational aspects related to the Company’s data, from sourcing through public and commercial tourist channels, over designing a robust schema and data model, to implementing and maintaining the data infrastructure using the latest technologies and best practices, with the aim to provide most current data in a secure, efficient, reliable and scalable manner to support back-end services and front-end user information display services.

The Senior Data Engineer will work together with external data providers. She/he will perform the technical analysis of the specific data interfaces, execute the data translation and integration with the in-house back-end systems by implementing or developing respective Extract, Transform, Load (ETL) solutions, and, together with the Product Manager, roll-out prototypes and final products to customers.

Further to the primary tasks of the Senior Data Engineer, it is the Company’s strategy to broaden the scope towards software engineering tasks in the backend application stack over time to cross-functionalize the IT department.

The Senior Data Engineer reports to the Product Manager; as the first in-house data engineer she/he owns the unique chance to develop from ground zero the Company’s data engineering processes and to manage the Company’s data.

The Senior Data Engineer will establish and lead the data engineering team.

Essential Qualifications

  • Master in IT with 7 years of job related experience
  • Experience as a software engineer, data engineer, data architect or any combination of the roles
  • Software programming skill set and sound knowledge of design patterns
  • Experience in object oriented programming, preferably using Java
  • Deep understanding of database schema design and data structures
  • Experience in data modelling using inter alia entity relationship diagrams, UML
  • Experience with RDBMS: MySQL, PostgreSQL, MS SQL or Oracle
  • Experience with ETL tools, preferably open source, such as Talend, Pentaho
  • Experience with RDBMS spatial extensions
  • Experience with structured data communication: SOAP, REST/XML, JSON
  • Excellent technical communication skills
  • Software modelling, architecture & software services design experience
    • Unified Modelling Language (UML)
    • User stories, use cases
    • System design
    • Functional and technical systems specifications
  • Experience with Back-end and front-end application development
  • Experience in IT project management
  • Knowledge of NoSQL database concepts and their typical use cases
  • Experience in team leadership

Desirable Qualifications

  • Experience in working in an international environment
  • Experience with NoSQL databases such as graph DB (e.g. Neo4j), document DB (e.g. MongoDB) or key-value stores (e.g. Voldemort or CouchDB)
  • Experience with Java Enterprise: J2EE, JSP/JSF, EJB, JDBC, JMS framework
  • Experience with Android development
  • Experience in web development and system setup: Apache/Tomcat, PHP

Office & Development Tools

  • MS Office
    • MS Word, MS Powerpoint: advanced user
    • MS Access, MS Excel: power user
  • Eclipse/Netbeans, Bitbucket/Git/GitHub, Confluence&JIRA,
  • XML Editor, such as XML Spy or Oxygen XML Editor
  • Enterprise Architect
  • PowerDesigner
  • Operating systems: Windows 8.1 or 10, Linux (Ubuntu, Debian, RedHat)

Soft Skills

  • Spirit to work in a start-up environment
  • Good communication skills
  • Good teamwork competencies
  • Sound analytical skills
  • Sound judgement
  • Service and customer oriented

Language Skills

  • We are operating in an international environment; hence a high English proficiency is mandatory.

Additional: fluent in Dutch, French or German, a third language is a plus

Apply:

Make sure that you are a member of the Brussels Data Science Community linkedin group before you apply. Join  here.

Please note that we also manage other vacancies that are not public, if you want us to bring you in contact with them too, just send your CV to datasciencebe@gmail.com .

For further information or to apply for this vacancy, please contact Bart Vandermeeren and include your CV.

Analytics: Lessons Learned from Winston Churchill

chrurchill

I had the pleasure to be invited for lunch by Prof. Baessens earlier this week and we talked about a next meetup subject that could be ‘War and Analytics’. As you might know Bart  is a WWI fanatic and he has already written a nice article on the subject called ‘Analytics: Lessons Learned from Winston Churchill’

here is the article—

Nicolas Glady’s Activities

Activities Overview‎ > ‎Online articles‎ > ‎ Analytics: Lessons Learned from Winston Churchill

Analytics has been around for quite some time now.  Even during World War II, it proved critical for the Allied victory. Some famous examples of allied analytical activities include the decoding of the enigma code, which effectively removed the danger of submarine warfare, and the 3D reconstruction of 2D images shot by gunless Spitfires, which helped Intelligence at RAF Medmenham eliminate the danger of the V1 and V2 and support operation Overlord. Many of the analytical lessons learned at that time are now more relevant than ever, in particular those provided by one of the great victors of WWII, then Prime Minister, Sir Winston Churchill.

The phrase “I only believe in statistics that I doctored myself” is often attributed to him. However, while its wit is certainly typical of the Greatest Briton, it was probably a Nazi Propaganda invention. Even so, can Churchill still teach us something about statistical analyses and Analytics?

 

A good analytical model should satisfy several requirements depending upon the application area and follow a certain process. The CRISP-DM, a leading methodology to conduct data-driven analysis, proposes a structured approach: understand the business, understand the data, prepare the data, design a model, evaluate it, and deploy the solution. The wisdom of the 1953 Nobel Prize for literature can help us better understand this process.

Have an actionable approach: aim at solving a real business issue

Any analytics project should start with a business problem, and then provide a solution. Indeed, Analytics is not a purely technical, statistical or computational exercise, since any analytical model needs to be actionable. For example, a model can allow us to predict future problems like credit card fraud or customer churn rate. Because managers are decision-makers, as are politicians, they need “the ability to foretell what is going to happen tomorrow, next week, next month, and next year… And to have the ability afterwards to explain why it didn’t happen.” In other words, even when the model fails to predict what really happened, its ability to explain the process in an intelligible way is still crucial.

In order to be relevant for businesses, the parties concerned need first to define and qualify a problem before analysis can effectively find a solution. For example, trying to predict what will happen in 10 years or more makes little sense from a practical, day-to-day business perspective: “It is a mistake to look too far ahead. Only one link in the chain of destiny can be handled at a time.”  Understandably, many analytical models in use in the industry have prediction horizons spanning no further than 2-3 years.

Understand the data you have at your disposal

There is a fairly large gap between data and comprehension. Churchill went so far as to argue that “true genius resides in the capacity for evaluation of uncertain, hazardous, and conflicting information.”  Indeed, Big Data is complex and is not a quick-fix solution for most business problems. In fact, it takes time to work through and the big picture might even seem less clear at first. It is the role of the Business Analytics expert to really understand the data and know what sources and variables to select.

Prepare the data

Once a complete overview of the available data has been drafted, the analyst will start preparing the tables for modelling by consolidating different sources, selecting the relevant variables and cleaning the data sets. This is usually a very time-consuming and tedious task, but needs to be done: “If you’re going through hell, keep going.”

Never forget to consider as much past historical information as you can. Typically, when trying to predict future events, using past transactional data is very relevant as most of the predictive power comes from this type of information. “The longer you can look back, the farther you can look forward.”

read more here

Launching the first Data Science Bootcamp in Europe

We are so happy to launch the first European  data science bootcamp

It is so nice to write this page on the launch of the first European data science bootcamp that will start this summer in Brussels. This initiative will boost the digital transformation effort of each company by allowing them to improve their data skills either by recruiting trainees and young graduates or transforming existing BI teams to become experienced business data scientists.

Intense 5+12 weeks approach to focus on practical directly applicable business cases.

The content of this bootcamp originated from the Data Science Community. Following the advice of our academic,  innovation and training partners we have decided to offer a unique hands-on 5 + 12 weeks approach.

  1. We call the first 5 weeks the Summer Camp (starts Aug 16th).  The participants work onsite or remote on e-learning MOOCs from DataCamp to demonstrate their ability to code in Python, R, SAS, SQL and to master statistical principles. During this period experts put all their energy into coaching the candidates in keeping up the pace and finishing the exercises. All the activities take place in our training centre located in the European Data Innovation Hub.
    -> If you are a young graduate you can expect to be contacted by tier one companies who will offer you a job or traineeship that will start with the participation to the datascience bootcamp.
  2. The European Data Science Bootcamp starts September 19thDuring a 12 week period – every Monday and Tuesday – participants will work on 15 different business cases presented by business experts from different industries and covering diverse business areas. Each Friday, the future data scientists will gather to work on their own business case, with coaching by our data experts to achieve an MVP (Minimum Viable Product) at the conclusion of the bootcamp.

Delivering strong experienced business data science professionals after an intense semester of hands-on business cases.

Companies are invited to reserve seats for their own existing staff or for the young graduates who have expressed interest in following the bootcamp.

 Please reserve your seat(s) now as, this bootcamp is limited to 15 participants.

Please contact Nele Coghe on training@di-academy.com or click on di-Academy to learn more information about this first European Data Science Bootcamp.

  • Here  is the powerpoint presentation explaining the Bootcamp.
  • Here is the presentation done by Nele during the Data Innovation Summit.

Hope to see you soon at the Hub,

Philippe Van Impe
pvanimpe@di-academy.com

Nurturing your Data Scientists : 70 years of accumulated experience at your service!

The Data Science community is proud to announce the entrance in of a startup dedicated to data scientists coaching and nurturing: WeLoveDataScience

Data scientist… Where are you? Do you really exist?

That’s the question many managers do currently face. Data Scientists is a scarce 5-legs sheeps: difficult to find, difficult to hire and difficult to keep.

 

logo_WeLOveDataScience_width2000

 

WeLoveDataScience is a brand new business unit, hosted at the European Data Innovation Hub, dedicated to searching, selecting, training, coaching and nurturing data science talents. The Belgian market does not propose enough candidates: we will take care to train next-gen data scientists for you.

Whatever your projects are, we propose to prepare for you the data scientist(s) you need, following these 7 steps:

  1. Together we prepare a detailed job description corresponding to your real needs: is this about data analysis, basic queries, reporting, data mining, big data or new technologies?
  2. We identify candidates on the market in particular through close collaborations with Belgian universities (Ghent, ULB/VUB, UCLouvain…)
  3. You hire the right candidate!
  4. He/she assists a 12 weeks data science bootcamp , including: high-level overview, data science for business, technical stack (SAS, R, Python…), introduction to specific technologies/topics (NoSQL or graph databases, social network analysis, text-mining..)
  5. Then he/she works for you at our offices during 4 tot 6 month, coached by one of our expert. On the job coaching on real projects: your projects but also hackathons, technology workshops, meetups…
  6. After those 10 intensive months, (s)he is ready to work for you on site. (S)He will demonstrate his/her knowledge by giving a course on a specific topic and/or writing entries in specialised blogs, giving a presentation at a conference…
  7. We assist you in yearly evaluation and follow-up.

Our intentions for 2016 are to help companies create and develop data science teams and to build a data science culture… And WeLoveDataScience: this is 70 years of accumulated experience at your disposal!

Want to know more? Visit: www.welovedatascience.com , send an email to info@welovedatascience.com or simply fill in this contact form. We will visit and explain what we do and brainstorm about your specific needs.

 

Job – Predicube – Senior Data Scientist

predicube

Senior Data Scientist

Experience: 3+ years of experience
Employment Type: Full-Time
Description:
We are looking for a senior data scientist to expand our growing team. You will be working on your dedicated project related to cross-channel advertising. We offer a dynamic job on the 19th floor of the KBC tower in Antwerp, with possibility to be included in the employees’ stock option program.

If you have good data science skills, a passion for new challenges and knowledge of big data technology, you are our man or woman!
Desired Skills and Expertise:
Expertise

  • Data science and predictive modeling
  • Computer Sience, Physics or Mathematics background

Technology

  • Python, Linux
  • Knowledge about Amazon Web Services (Elastic MapReduce, Spark, etc.) is a definite plus.
  • Ethical reflex concerning privacy-friendly analytics

Apply:

Make sure that you are a member of the Brussels Data Science Community linkedin group before you apply. Join  here.

Please note that we also manage other vacancies that are not public, if you want us to bring you in contact with them too, just send your CV to datasciencebe@gmail.com .

Interested? The original job posting is available at this link: http://www.predicube.com/solutions.html#section6

More Jobs ?

hidden-jobs1

Click here for more Data related job offers.
Join our community on linkedin and attend our meetups.
Follow our twitter account: @datajobsbe

Improve your skills:

Why don’t you join one of our  #datascience trainings in order to sharpen your skills.

Special rates apply if you are a job seeker.

Check out the full agenda here.

Video channel and e-learning:

Follow the link to subscribe to our video channel.

Join the experts at our Meetups:

Each month we organize a Meetup in Brussels focused on a specific DataScience topic.

Brussels Data Science Meetup

Brussels, BE
1,548 Business & Data Science pro’s

The Brussels Data Science Community:Mission:  Our mission is to educate, inspire and empower scholars and professionals to apply data sciences to address humanity’s grand cha…

Next Meetup

How can #datascience help fight earth warming? + New Years D…

Thursday, Jan 21, 2016, 6:30 PM
34 Attending

Check out this Meetup Group →

The ABC of Datascience blogs – collaborative update

abc-letters-on-white-sandra-cunningham

A – ACID – Atomicity, Consistency, Isolation and Durability

B – Big Data – Volume, Velocity, Variety

C – Columnar (or Column-Oriented) Database

  • CoolData By Kevin MacDonell on Analytics, predictive modeling and related cool data stuff for fund-raising in higher education.
  • Cloud of data blog By Paul Miller, aims to help clients understand the implications of taking data and more to the Cloud.
  • Calculated Risk, Finance and Economics

D – Data Warehousing – Relevant and very useful

E – ETL – Extract, transform and load

F – Flume – A framework for populating Hadoop with data

  • Facebook Data Science Blog, the official blog of interesting insights presented by Facebook data scientists.
  • FiveThirtyEight, by Nate Silver and his team, gives a statistical view of everything from politics to science to sports with the help of graphs and pie charts.
  • Freakonometrics Charpentier, a professor of mathematics, offers a nice mix of generally accessible and more challenging posts on statistics related subjects, all with a good sense of humor.
  • Freakonomics blog, by Steven Levitt and Stephen J. Dubner.
  • FastML, covering practical applications of machine learning and data science.
  • FlowingData, the visualization and statistics site of Nathan Yau.

G – Geospatial Analysis – A picture worth 1,000 words or more

H – Hadoop, HDFS, HBASE

  • Harvard Data Science, thoughts on Statistical Computing and Visualization.
  • Hyndsight by Rob Hyndman, on fore­cast­ing, data visu­al­iza­tion and func­tional data.

I – In-Memory Database – A new definition of superfast access

  • IBM Big Data Hub Blogs, blogs from IBM thought leaders.
  • Insight Data Science Blog on latest trends and topics in data science by Alumnus of Insight Data Science Fellows Program.
  • Information is Beautiful, by Independent data journalist and information designer David McCandless who is also the author of his book ‘Information is Beautiful’.
  • Information Aesthetics designed and maintained by Andrew Vande Moere, an Associate Professor at KU Leuven university, Belgium. It explores the symbiotic relationship between creative design and the field of information visualization.
  • Inductio ex Machina by Mark Reid’s research blog on machine learning & statistics.

J – Java – Hadoop gave it a nice push

  • Jonathan Manton’s blog by Jonathan Manton, Tutorial-style articles in the general areas of mathematics, electrical engineering and neuroscience.
  • JT on EDM, James Taylor on Everything Decision Management
  • Justin Domke blog, on machine learning and computer vision, particularly probabilistic graphical models.
  • Juice Analytics on analytics and visualization.

K – Kafka – High-throughput, distributed messaging system originally developed at LinkedIn

L – Latency – Low Latency and High Latency

  • Love Stats Blog By Annie, a market research methodologist who blogs about sampling, surveys, statistics, charts, and more
  • Learning Lover on programming, algorithms with some flashcards for learning.
  • Large Scale ML & other Animals, by Danny Bickson, started the GraphLab, an award winning large scale open source project

M – Map/Reduce – MapReduce

N – NoSQL Databases – No SQL Database or Not Only SQL

O – Oozie – Open-source workflow engine managing Hadoop job processing

  • Occam’s Razor by Avinash Kaushik, examining web analytics and Digital Marketing.
  • OpenGardens, Data Science for Internet of Things (IoT), by Ajit Jaokar.
  • O’reilly Radar O’Reilly Radar, a wide range of research topics and books.
  • Oracle Data Mining Blog, Everything about Oracle Data Mining – News, Technical Information, Opinions, Tips & Tricks. All in One Place.
  • Observational Epidemiology A college professor and a statistical consultant offer their comments, observations and thoughts on applied statistics, higher education and epidemiology.
  • Overcoming bias By Robin Hanson and Eliezer Yudkowsky. Present Statistical analysis in reflections on honesty, signaling, disagreement, forecasting and the far future.

P – Pig – Platform for analyzing huge data sets

  • Probability & Statistics Blog By Matt Asher, statistics grad student at the University of Toronto. Check out Asher’s Statistics Manifesto.
  • Perpetual Enigma by Prateek Joshi, a computer vision enthusiast writes question-style compelling story reads on machine learning.
  • PracticalLearning by Diego Marinho de Oliveira on Machine Learning, Data Science and Big Data.
  • Predictive Analytics World blog, by Eric Siegel, founder of Predictive Analytics World and Text Analytics World, and Executive Editor of the Predictive Analytics Times, makes the how and why of predictive analytics understandable and captivating.

Q – Quantitative Data Analysis

R – Relational Database – Still relevant and will be for some time

  • R-bloggers , best blogs from the rich community of R, with code, examples, and visualizations
  • R chart A blog about the R language written by a web application/database developer.
  • R Statistics By Tal Galili, a PhD student in Statistics at the Tel Aviv University who also works as a teaching assistant for several statistics courses in the university.
  • Revolution Analytics hosted, and maintained by Revolution Analytics.
  • Rick Sherman: The Data Doghouse on business and technology of performance management, business intelligence and datawarehousing.
  • Random Ponderings by Yisong Yue, on artificial intelligence, machine learning & statistics.

S – Sharding (Database Partitioning)  and Sqoop (SQL Database to Hadoop)

  • Salford Systems Data Mining and Predictive Analytics Blog, by Dan Steinberg.
  • Sabermetric Research By Phil Burnbaum blogs about statistics in baseball, the stock market, sports predictors and a variety of subjects.
  • Statisfaction A blog by jointly written by PhD students and post-docs from Paris (Université Paris-Dauphine, CREST). Mainly tips and tricks useful in everyday jobs, links to various interesting pages, articles, seminars, etc.
  • Statistically Funny True to its name, epidemiologist Hilda Bastian’s blog is a hilarious account of the science of unbiased health research with the added bonus of cartoons.
  • SAS Analysis, a weekly technical blog about data analysis in SAS.
  • SAS blog on text mining on text mining, voice mining and unstructured data by SAS experts.
  • SAS Programming for Data Mining Applications, by LX, Senior Statistician in Hartford, CT.
  • Shape of Data, presents an intuitive introduction to data analysis algorithms from the perspective of geometry, by Jesse Johnson.
  • Simply Statistics By three biostatistics professors (Jeff Leek, Roger Peng, and Rafa Irizarry) who are fired up about the new era where data are abundant and statisticians are scientists.
  • Smart Data Collective, an aggregation of blogs from many interesting data science people
  • Statistical Modeling, Causal Inference, and Social Science by Andrew Gelman
  • Stats with Cats By Charlie Kufs has been crunching numbers for over thirty years, first as a hydrogeologist and since the 1990s, as a statistician. His tagline is- when you can’t solve life’s problems with statistics alone.
  • StatsBlog, a blog aggregator focused on statistics-related content, and syndicates posts from contributing blogs via RSS feeds.
  • Steve Miller BI blog, at Information management.

T – Text Analysis – Larger the information, more needed analysis

U – Unstructured Data – Growing faster than speed of thoughts

V – Visualization – Important to keep the information relevant

  • Vincent Granville blog. Vincent, the founder of AnalyticBridge and Data Science Central, regularly posts interesting topics on Data Science and Data Mining

W – Whirr – Big Data Cloud Services i.e. Hadoop distributions by cloud vendors

X – XML – Still eXtensible and no Introduction needed

  • Xi’an’s Og Blog A blog written by a professor of Statistics at Université Paris Dauphine, mainly centred on computational and Bayesian topics.

Y – Yottabyte – Equal to 1,000 exabytes, 1 million petabytes and 1 billion terabytes

Z – Zookeeper – Help managing Hadoop nodes across a distributed network

Feel free to add your preferred blog in the comment bellow.

Other resources:

Nice video channels:

More Jobs ?

hidden-jobs1

Click here for more Data related job offers.
Join our community on linkedin and attend our meetups.
Follow our twitter account: @datajobsbe

Improve your skills:

Why don’t you join one of our  #datascience trainings in order to sharpen your skills.

Special rates apply if you are a job seeker.

Here are some training highlights for the coming months:

Check out the full agenda here.

Join the experts at our Meetups:

Each month we organize a Meetup in Brussels focused on a specific DataScience topic.

Brussels Data Science Meetup

Brussels, BE
1,417 Business & Data Science pro’s

The Brussels Data Science Community:Mission:  Our mission is to educate, inspire and empower scholars and professionals to apply data sciences to address humanity’s grand cha…

Next Meetup

DATA UNIFICATION IN CORPORATE ENVIRONMENTS

Wednesday, Oct 14, 2015, 6:30 PM
57 Attending

Check out this Meetup Group →

Job – KBC – Junior data scientist

logo_kbc

What will your responsibilities be? 

In a changing market environment, where customer needs are more diverse and customer expectations are more personalized, KBC group wants to optimally use the growing “data footprint” of the market to become more customer centric and become a reference in data analytics. Ultimately, we aim to transform our business into a data driven group. KBC therefore wants to attract capabilities that are highly advanced in exploiting, analysing and modelling data.

What do we expect from you? 

  • Feeding local Business Units with unknown insights based on data and assisting them with the commercial activation of these insights
  • Analytical work on a portfolio of initiatives
  • Testing commercial hypothesis as suggested by Business Units or from within the own group
  • Becoming the reference for Big Data expertise within the Group with regards to Data Analysis & Modelling

Location 

Louvain, with regular trips to other KBC headquarters internationally.

What are your key strengths? 

  • A passion of working with data
  • Enterpreneur spirit
  • Open minded; willing to learn and look for alternative solutions
  • Dynamic and hard working

 

Your degree:

You have finished your Master’s degree or PhD in Mathematics, Applied Science, Computer Science, Statistic, Physics, Econometrics, Actuary, or comparable field.

Your experience

0-2 years’ experience in the field:

  • Our ideal candidate is passionate about working with data, professionally or in leisure time.
  • SQL, OLAP, JSON, XML have no more secrets for you. 
  • You are able to manage and analyse large data sets with analytic rigor by using statistical methods such as Exploratory data analysis, Bayesian statistics, Probability Theory, Regression, Correlation, Monte Carlo, hypothesis testing. 
  • You are familiar with programs like Python/R/SPSS, Web scraping, Java, .NET, C#. 
  • Previous experience or proven interest in machine learning, data/text mining, data ingestion, supervised/unsupervised learning predictive algorithms, classifiers, association, regression, Trees, KNN are a plus.
  • Ideally, you will possess a wide range of technical skills and highly proficient in turning data discoveries into insights for the business. 
  • Knowledge in the field of financial services, marketing or client behavior is an advantage (banking industry, telecom)

 

Your profile

  • A great drive and You challenge your colleagues and yourself.
  • Constant learning and adapting to a fast-paced environment. Where persistence, innovation, natural curiosity, pragmatic and creativity for problem solving are the key words.
  • You have excellent communication skills, can clearly present and visualize your work to others. You can clearly communicate the outcomes of complicated analyses. Making complex things simple without dropping the essence of the problem.
  • This unique combination makes you a good team player, but you are also able to work independently on different cases.

What can we offer you? 

You can count on KBC for:

  •  active support during your career,
  • an exceptional range of training and development opportunities,
  • many different career opportunities,
  • a permanent contract,
  • a competitive salary package, including an extensive package of additional benefits and special terms for employees for our banking and insurance products,
  • possibilities to integrate your work and private life,
  • a dynamic working environment with an open culture and pleasant atmosphere.

 

Apply:

Make sure that you are a member of the Brussels Data Science Community linkedin group before you apply. Join  here.

Please note that we also manage other vacancies that are not public, if you want us to bring you in contact with them too, just send your CV to datasciencebe@gmail.com .

Send your job application today! 

Apply by following this link.

The question “Are all Data Scientists nerds?” answered thanks to the Data Innovation Survey 2015

This article was originally published here

Although the Data Scientist has been declared the sexiest job of the 21st century by HBR and others, if we are honest, we need to admit that data scientists are still associated with nerds by the mainstream population. This data innovation survey was the perfect opportunity to me to investigate whether data scientists are really that nerdy as perceived by many.

I started this article by looking up some background information (after all, I do consider myself as a data scientist) on nerds. I found a very appropriate description on Wikipedia:

nerdNerd (adjective: nerdy) is a descriptive term, often used pejoratively, indicating that a person is overly intellectual, obsessive, or socially impaired. They may spend inordinate amounts of time on unpopular, obscure, or non-mainstream activities, which are generally either highly technical or relating to topics of fiction or fantasy, to the exclusion of more mainstream activities. Additionally, many nerds are described as being shy, quirky, and unattractive, and may have difficulty participating in, or even following, sports. Stereotypical nerds are commonly seen as intelligent but socially and physically awkward. Some interests and activities that are likely to be described as nerdy are: Intellectual, academic, or technical hobbies, activities, and pursuits, especially topics related to sciencemathematicsengineering and technology.

Does any of this sound familiar to you?

Let’s dive into the results of the data innovation survey, together with my best friend SAS Visual Analytics, to check if these stereotypes are true in the Belgian Data Science Landscape.

Stereotype n°1: All data scientists are young males

It probably doesn’t come as a surprise to you that the 87.2% of the respondents are male, but I’m glad to see that 36 other woman took the survey along with me. In terms of age, we do find a lot of youngsters, but the categories above 35 seem to be well represented too.

ds1
Note to the designer of the survey: next time please don’t foresee fixed age categories but let people type their real age if you want to see more interesting graphs than poor pie charts…

 Stereotype n°2: Data scientists are in front of their computer all night

Participants had nine days to respond to the survey. In the bar chart below you can see on which days the 289 respondents submitted the survey. We observe a clear pattern in the beginning of both weeks and strangely enough a drop towards Friday 13th… Maybe data scientists are more superstitious than they would like to admit?

Even more interesting to analyze are the times of the day when people took the survey. To my big surprise there’s a peak in the morning, so the Belgian data scientists seem to be early birds!

ds2

As we received the start time and the end time, I also calculated how long the average data scientist took to solve the questionnaire: 12.66 minutes, but the median data scientist had the job done in 10 minutes. We all remember our first statistics class: when the median is not equal to the mean, there is no symmetric distribution…

ds3ds4

Stereotype n°3: Data scientists are disconnected from the real world

If all data scientists are actually nerds, then they should all be quite “unworldly”. According to the Belgian Data Science survey, almost one third is working for a business organization or NGO with 7 777 employees worldwide on average, doesn’t sound that nerdy to me…

ds5

In total, 42% of the Belgian data scientists who took the survey are employed in the IT and technology industry. Ok, what else did you expect?

ds6

If data scientists were really that socially inadequate as what could be believed by some bad influences, ds7  they would never make it to a management position in their organization. And look, almost 55% our respondents have management responsibilities to a certain extent.

Stereotype n°4: All Data scientists hold a PhD in science or mathematics

Wrong again! Only 18.3% of the Belgian Data Scientists are holding a PhD degree. Although the majority graduated in science&math, ict or engineering, a significant amount completed commerce or social studies.

ds8

ds9

Stereotype n°5: All Data scientists are programming geeks and only use non-mainstream techniques

In part 6 of the survey, participants were asked to rate their skills with a score between 1 (don’t know this technique) and 5 (I’m a guru). It turns out that data scientists are not all guru’s in the newer techniques like big data and machine learning but are mostly familiar with traditional techniques like data manipulation (regexes, Python, R, SAS, web scraping) and structured data (RDBMS, SQL, JSON, XML, ETL).
ds10Although we observe some quite high correlations (between math & optimization 0.73, big data & unstructured data 0.67, …) it doesn’t necessarily mean that the scores are high on these topics. This is clearly illustrated with the heat maps below. On the left we have math and optimization which are highly correlated but with low scores, and on the right there is data manipulation and structured data with a moderate correlation of 0.42 but with the highest scores.

ds11 ds12

Stereotype n°6: All Data scientists are socially isolated and afraid to appear in public

The Belgian Data Scientists don’t only attend the monthly meetup meetings to learn about the new developments in Data Science or to hear what’s happening on the Belgian Data Science scene, but many of them also state social and networking reasons as motivation to get away from their pc to attend these meetings.

ds13

Stereotype n°7: There are clear role models for data scientists, they all look up to the same persons

Not that many respondents seem to be influenced by other data scientists in this world, as only a few of them answered this question with the name of a fellow data scientist and mostly different ones. For Belgium on the other hand, we do find two names that each appeared eight times among the answers. Congratulations to Bart Baesens and Philippe Van Impe, the Belgian Data Science guru’s!

ds14

Conclusion

The conclusion of the analysis of the Data Innovation Survey is as straightforward as simple: Data Scientist is the sexiest job of the 21st century! Unfortunately I’ll have to finish off here as my pole dancing class is going to start…

%d bloggers like this: