Boot Camp Graduation

On Tuesday, December 20th, we celebrated the completion of the di-Academy’s first ever data science boot camp.

With the holidays approaching, not all of our boot-campers could make it to the event to celebrate with us, but all of the boot-campers in attendance gave five-minute presentations on the personal projects they had been working on during the boot camp.


All graduates were given an academic cap and an “I survived the data science bootcamp” t-shirt by the wonderful dean of the di-Academy, Nele Coghe .

After the presentations, we, of course, celebrated with pizza and beer.

And had some fun with virtual reality.

Boot Camp Graduate: Agustina Perez Iriarte

Allow me to introduce Agustina Perez Iriarte, one of our seventeen boot camp graduates from the di-Academy. I met Agustina at our summer coding camp, and I sat down with her to discuss her time before the boot camp, her time during it, and her plans afterward. Having arrived in Brussels last February, Agustina is from Buenos Aires, Argentina. Though she had studied literature, her first job out of college required that she learned to code. “I really had to do a lot coding at the job, and I realized that I was good and I was enjoying it” she said, “so maybe this is the kind of job I’d like to do.” She began some training on web programming and web design, receiving two certifications from the Universidad Tecnolólogica de Buenos Aires.

In 2010, Agustina began working for American Express in Argentina. Her boss was interested in doing more work with data, and she volunteered for the job. Though not as a data scientist, for the next four years she worked with data, and she was eventually drawn to data science as a career. “I knew I wanted to do some work with data,” she told me, “and I wanted to do something creative, and I think that data science has the two mixed.”

A natural autodidact, Agustina found MOOCs to be central to developing her technical skills, finding the courses of Andrew Ng and Peter Norvig particularly influential. “I think that I’m mostly self-taught,” she told me. “I’m curious, so whenever I start learning something, I feel that I have a gap and I need to learn something more. And I jump into that, and then I find a gap there and I need to learn something more. I never stop learning.”


Agustina met her husband, Jonathan, in Argentina. After completing his degree in physics, he wanted to take a few months to travel, so he left for South America for three months to do volunteer work in exchange for food and housing. Jonathan began his trip in Equator, however, he loves tango, and couldn’t think of a better place to end his trip than the dance’s birthplace. Two weeks before the end of his trip he arrived in Buenos Aires. Sharing his passion, Agustina met Jonathan dancing tango.

Jonathan is Belgian, and he works as a data scientist for Carrefour. Agustina learned about the di-Academy’s boot camp when he mentioned to her that the company was looking for future employees to send to the boot camp. The more she learned about it, the more interested she became, as he explained to her the goal of the program was to combine academic knowledge with real business cases. Her work on MOOCs had given her lots of theoretical knowledge about data science, but what she found lacking in the material she had learned was the practical aspects. “You don’t have the real experience,” she said, “You’re working with prepared data on real known cases, and I wanted to know what the real story is when you work there.” The community aspect of the boot camp was also central to her interest. “Whenever I worked on MOOCS, I was the only one passionate about data, I didn’t have people that shared the same interests as I did. So I said okay, a group of freaks like me all joined together by the same passion,” she told me, “I’m in.”


Two freaks in their natural habitat

During the first week of the boot camp, the founder of the European Data Innovation Hub and the Brussels Data Science Community, Philippe Van Impe approached her, and asked her what what type of job she was looking for. After showing her’s website, he asked if she would be interested in working with them. She agreed, and two days later had an interview with Karen Boers, the director of the group. “She’s one passionate woman.” Agustina said, “She spoke for forty minutes nonstop about what they do, and I was extremely convinced I wanted to be part of that.” Her internship mostly consisted in helping manage their data, cleaning it and restructuring it, and performing analysis, all to get a better picture of what’s going on with startups in Belgium. “I like being part of the start-up environment,” she said. “As a data scientist, normally you think that one way is to work for a big company that has resources, but we discovered with a little company, with a start-up, you have the same resources and the motivation to really have a say, because you’re not part of the ten-data-scientist team. There’s one person there just to try to make the most of it.”

Since the boot camp graduation on December 20th, Agustina has finished her three-month internship, and those of us here at the di-Academy were thrilled to find out that Agustina is in the process of signing a contract with to join their team full-time.

BreakDengue Hackathon Prize 1 Winners: The Cube

by Adrien Dewez (Brand Analyst) & Thomas De Trogh (Media Analysis Expert)

Dengue, an exotic virus?

On the 25th and 26th of November, the Cube participated in the Dengue Hackathon organized by the diHub. The Cube’s goal in this event was to analyze digital messages about the Dengue Virus on a worldwide level in order to make some interesting insights emerge. In addition to Europe, we chose two countries, Brazil and India, to draw a panorama on how people communicate about the Dengue.

On an annual basis, the first results show surprising patterns: even if India’s population is six times larger than that of Brazil, the country only generated 309,000 messages, compared to Brazil’s 1.1 million [1]. A quantitative subdivision of media types brings Twitter in first place, with 72% of all messages sent for Brazil, and 88% for India).  


indiagraphWe notice a reactive pattern for India: one big communication peak during two weeks which correlates with the explosion of the virus. There’s no other peak before or after the outbreak, nor any “momentum” of communication. In contrast, Brazil seems to have a societal approach: the communication is characterized by different peaks for a period of at least four months.
If we look closely to the results on a small period of 4 days before and after the most “communicative” days in 2016, India lives with an even higher percentage of Twitter results in the media mix (93% of all messages) meanwhile Brazil gives the exact same results over the year.

Emotion vs. Analysis
Our tool allows us to detect the most frequently associated words with Dengue. In both countries, although the other associated terms are very different, the Dengue virus is associated with another “mosquito-virus”: Zika in Brazil and Chikungunya in India, both transmitted by the same mosquito as Dengue, Aedes Aegeptyca. In Brazil, we see words from other South-America countries such as Venezuela and Salvador, institutions both national and international (Ministerio da saude, mundial, etc.). In India, we don’t find any institution, but rather political figures such as Narenda Modi, Arvind Kejriwal and Satyendar Jain. The results were filtered with the names of institutions (in different languages), and we saw that they appeared in 5% of the Brazilian results, while in the Indian results, they only match 1% of the time. We could carefully advance that India has a more reactive response to the Dengue, due to the outbreak of the disease, while Brazil has a more societal approach of the problem.

Minecraft, Really?
Pushing the analyses further brings (even more) astonishing results. Over a period of eleven months, more than 11,000 YouTube videos about Dengue were produced in Brazil. The most popular is a 12 minute video based on the video game Minecraft, we can see patients and a doctor as Minecraft characters, with one character  talking about the disease. There were also many tweets using Dengue as an expression, for example, “my WhatsApp is as still as the water for the Dengue’s mosquitos.” Twitter isn’t only used to talk about the Dengue disease in a literal sense, but also in a metaphorical way.

Crossing data
The website shows interesting maps illustrating the regions where outbreaks have happened. These outbreak data provided us with our first valuable insight: in which provinces/districts could the stream of social media messages be significantly larger than somewhere else? We reasoned that people in an outbreak region are more inclined to post messages on Twitter or Facebook to say that they’re sick. Since India and Brazil generated quite large datasets, we would be able to make conclusions based on a significantly large amount of data.
When an outbreak of any disease is starting to proliferate, doctors talk about it, the media talks about it, and people talk about it. When visiting friends, going to work, or having conversations with family members, possible outbreaks of fever are always “the talk of the town.” In autumn, during the change of seasons when fever outbreaks are more common, even commercials adapt to this situation. Ads for cough syrup appear more often than usual.

Mood clouds
We focused on words in multiple languages in order to gather as many results as possible, mainly adjectives that express an emotion, a state of mind, or a mood. Valuable messages for us could be Twitter posts with sentences like “I’m ill,” “I feel sick,” “Many patients at the doctor’s,” “Many of my friends have fever,” etc. Monitoring words like “sick,” “fever,” “ill,” “illness,” “bad,” “pain,” “headache,” “feeble,” “afraid,” etc., in combination with each other over time, looking to peaks in the number of messages along with more engagement by people who retweet, start discussions, etc., could indicate a possible outbreak of fever in a certain region.

First, we searched for regions where outbreaks were detected during the last several months. The following picture is rendered by the open source website and shows a cluster of outbreak cases in the Kanataka region in India in June, 2016. The red and purple dots indicate multiple verified cases of dengue fever.


Figure 1 Outbreak of dengue in the Kanataka region in June 2016 (source

Next, we zoomed in and started to gather social and online messages from this region, focussing on words associated with emotions. The next three word clouds show the top 25 results from January to March.


These mood word clouds already indicate a shift in discourse about dengue. People in the Kanataka region, whether journalists, bloggers, or citizens who posted on Twitter or Facebook, started to talk about dengue as a more or less far away problem. Words as “Panadol,” “Sanofi Pasteur,” “clean,” “helpful” or the main word “approved” indicate preparatory policies and a search for cures to the disease. An outbreak is still far away.


In February and, more obviously, in March, the discourse changes. More words related to symptoms begin to appear: “cold,” “feeble,” “overworked,” or “positive.” At the same time, words like “hope,” “prepared,” “clean,” “fear,” or “determined” indicate an increase in worried reactions for a possible outbreak. Looking to the words people use in association with dengue the two months before and the month of the outbreak, the shift in discourse is even more striking.


Words such as “patient,” “suffering,” “desperate,” “critical,” “alert,” “stagnant,” or “ICU” (Intensive Care Unit) jump out and are used very frequently. This could indicate that an outbreak has taken place. At the same time, we see terms associated with the Indian government emerge: “Arvind Kejriwal” (the Chief Minister of Delhi) and “Mcd” (Municipal Corporation of Delhi). The next months, these political leader names will emerge even more and correspond to the giant peak in September.

What did we learn from these results? By associating mood words with dengue, we were able to detect shifts in discourse about the disease itself. By using a relevant search query, related to mood words and terms that are linked to symptoms (“cold,” “illness,” “fever,” “feeble”, “headache,” etc.), social media monitoring tools could be used to detect possible outbreaks in a premature phase.
Crossing medical and health data with social and online media messages could in this way provide valuable insights to governments, international institutions as the WHO and UNICEF, or local health organizations.

Despite the fact that the time we had for our presentation was limited, we wanted to make a rapid analysis of the situation in Europe. The timeline and the associated terms show that the European news on Dengue is linked to Brazil, with the same timeline and associated words with South America’s countries and virus. Further, as a remarkable coincidence, the same number of YouTube videos in Brazil and Europe were posted: 11,000 videos, though the Minecraft aspect seems more common in Brazil than in Europe. Without any surprise and due to the linguistic proximity, Spanish is the leading language in Europe.

It seems important that India treats the Dengue virus as a societal issue to spread the message more widely (and act on prevention more than reaction). The best way to do this seems to be to associate a public figure for the communication. In Brazil, both national and international institutions are a good ally. We would like to underline the increasing importance of alternative mediums such as YouTube, Minecraft, and WhatsApp as tools to prevent or stay informed on a coming Dengue explosion. Finally, Europe has to see the Dengue virus as a present threat and not as an exotic virus. Cases of Dengue were reported in Spain, Portugal, the South of France, Italy and even Germany. Considering the enormous economic impact it has in other countries, the risk of Dengue becoming endemic in Europe can’t be underestimated. These are our first conclusions of a rapid approach. Bringing more precise results and insights needs a deeper dive into the Dengue digital world.

[1] Results from 1st January to 26th of November.

You can view their presentation from the Hackathon here


Hackathon Winners: Stream, Using Twitter to Predict Dengue Outbreaks

Team Stream, winners of the Brussels Region Prize for the Best Team Related to Developing Countries, aimed to tackle the problem of Dengue being generally under-recognized by increasing its visibility to the general public. Lead by Ciprian Iamandi, the team consisted of Hannah Pinson, Laszlo Kupcsik, Laurent Exsteens, and the di-Academy’s bootcamper Sabrina Trifi.

In previous research done in Brazil, the paper Dengue surveillance based on a computational model of spatio-temporal locality on twitter by Gomide et al found that the R squared coefficient between personal tweets related to dengue and confirmed cases was 0.9578, an extremely strong correlation. Team Stream thought this would be a promising place to start from.  

Their process was as follows: gather social media data real-time, filter this data on personal experiences, identify clusters from this data in real time, and finally, to visualize the data.


The first step of their process was choosing a region to study. They chose Latin America for two reasons, the relative uniformity of the languages spoken, Portuguese and Spanish, and a larger penetration of internet compared to the global average, 66.7% compared to 50.1%. Their next step was to gather tweets from this region containing the word ‘Dengue’, and then filter these tweets based on the machine learning classification algorithm Random Forest to determine whether they expressed a personal experience of a dengue case to eventually finish with workable data, a set of tweets that contain the word ‘dengue’, are geo-localized to the region they wished to study, and are related to personal experience.

From this collection of tweets, the team used the stream clustering algorithm, d-stream, to create location based clusters that predict the prevalence of Dengue. The next step was to compare these clusters to

Cluster_correlation (1).png

Cluster Correlation

In the image above, the intensity of the color of the region represents the prevalence of dengue outbreaks in that region, and the circles represent clusters identified for a particular region, with the intensity of that color representing the predicted prevalence of dengue in that region.

The correlation between the tweets and the confirmed Dengue cases is -> insert number here.Moving forward, Ciprian and his team hopes to adapt their method to update their algorithm to use streaming, rather than downloaded data. Further, they plan on adding new sources of data, such as Google searches and Facebook statuses, and develop their work into a web-based application, creating an actionable, sustainable, and scalable tool to educate the global public about dengue and its outbreaks.

You can view the video of their presentation at the hackathon below.

Overall Winner of the DengueHack Hackathon: DatAsset

Dengue, the most important arbovirus (arthropod borne virus, viruses such as yellow fever and Zika) is spreading rapidly, and half of the world’s population is at risk of contracting it.

In the last half of a century, the number of people affected by dengue has increased 30-fold (say whaat!?), and it threatens to invade Europe and the United States.

We were happy to stumble upon the project, so we gathered together an AI kid, a monsieur docteur, and a happy bunch of data wizards to stop this ominous prospect by hacking the heck out of Dengue. In data we trust! Together, as data scientists, we embarked upon this epic 36-hour journey.

Ideas for approaching this project were manifold, and we wanted to really use our team’s diverse set of skills. Eventually, we settled on three goals to work toward: building a predictive incidence model for Sri-Lanka, one of the more heavily impacted countries by Dengue in South-Asia, building a cool visual representation of past and predicted data, and exploring the possibilities of our approach from a public health perspective.

For our model, we experimented with included important national socio-economic and environmental data, with the goal of ultimately applying it to different countries as a next step.

Why building a predictive model is so important: If an outbreak is predicted in a late stage, intervention measures are decreasingly effective. If an epidemic can be detected earlier or even predicted before occurring, the number of cases avoided increases (as can be seen in the image below).

The value of outbreak prediction

Who needs sleep? Even during the few hours of rest we had, our models were running in our brains. With concerted effort, at the end of the thirty-six hours, we’ve arrive at a nice final presentation for you to enjoy. Feedback is most welcome!

Thanks so much to the organizers for this wonderful event, you’ve made us realize that there’s so much common ground and cause between the worlds of data science and public health. Data for good has a bright future!


The DatAsset team, Klaas Michael Pieter Andreas Joren

Hackathon Winner for Best Storytelling: XploData

XploData are the’s Hackathon prize winner for Best Storytelling. You can view their presentation from the event here.

Our[1] goal at the hackathon, was to determine the main factors or variables (climate, population, livestock, vegetation…) that influence both Aedes mosquitos and the Dengue virus. We approached this by building two models, one that would be able to take climate and population data to predict mosquito presence in parts of the world where this data is lacking, and the second model would integrate this new data with climate and population data to predict Dengue outbreaks.

In the process of the hackathon, we faced several problems in creating these models, however, we turned our focus to a different second model. Data on Dengue world-wide is inconsistent: we have reports of countries with Dengue, countries where we can confidently say Dengue is not present, and countries where Dengue may or may not be present. By looking at environmental variables that explain the presence or lack thereof of Dengue, we were able to create a model that could estimate the chance of Dengue being present in the countries where Dengue remains unconfirmed.

The problem we faced in building our first model was that the available data for the mosquitoes was only ‘presence’ data, and lacking real absence data. We therefore added an artificial temperature threshold in this model, to create so-called ‘pseudo-absence’ data of mosquitoes. Of course, this resulted in a model that lacked the sophistication we had hoped for. The future is promising although. For instance, satellites and satellite imaging are constantly improving. With improved imaging technology, we could gather more precise data on vegetation, livestock movement and population movements. Bringing this data together in one table with as many points on earth as possible with climate data, population data, livestock data, vegetation data, the amount of standing water, mosquito data, and Dengue data would greatly improve our predictions on Dengue. After all, the impact of being able to predict presence of Dengue cannot be overstated.

To conclude, we want to thank the organizers for giving us the opportunity to join the Hackathon, discover new technologies and meet interesting people. Also special thanks to the members of TeraData who greatly helped us during our preparations and final building of the models.

[1] Our team consisted of members of XploData, i4BI, Janssen Pharma and University of Liege, and are scientists, engineers, physicists, and informatics-specialists. We worked multidisciplinary, combining various skills such as data engineering, data science, data modelling and data visualization. Work hard, play hard (Figure 1).


figure1.pngFigure 1: Hacking is fun!

Launching The Dengue Hackathon

On October 11th, the diHub hosted the launch event for the hackathon, taking place on November 25th and 26th.Each Tuesday leading up to the hackathon, you’re welcome to join our meetings at the diHub to discuss the data we’ve gathered and prepare for the hackathon. You can learn more about our upcoming events on our meetup page.

We were lucky enough to have the following speakers present: Serge Masyn from Janssen (Pharmaceutical company of Johnson and Johnson), Dr. Guillermo Herrera-Taracena from Johnson & Johnson, Anne-Mieke Vandamme, a professor at KU Leuven, Daniel Balog, Stefan Pauwels, and Tom Crauwels from Luciad, Jeroen Dries from Vito, Guy Hendrickx from Avia-GIS, and Pierre Marchand from Teradata.

Annelies Baptist, bootcamp participant and project manager for the hackathon, opened the presentation by explaining the importance of our hackathon and fighting the spread of dengue, and ended by introducing the rest of our speakers.

Copy of IMG_2554.jpg

Serge Masyn, director at Janssen Global Public Health, presented Janssens three goals for the hackathon: to raise awareness about global public health, to raise awareness on dengue, and to try to create new insights into the spread of dengue and predictions into future outbreaks. A year ago, this initiative was only an idea, and Serge was pleased to see how much progress we’ve made toward making it a reality (here is a video from the March 2016 di-Summit, where Serge announced Janssen’s desire to sponsor what would become this very dengue hackathon).

Copy of IMG_2444.jpg

Serge then introduced Dr. Guillermo Herrera-Taracena, the global clinical leader on infectious diseases and global public health for Johnson & Johnson. Guillermo is an engaging and enthusiastic speaker, and he made a point to emphasize the importance of this work to global health at large. After the ebola outbreak, Zika took its place in the public perception as the leading global health concern. Though Dengue is a serious public health burden in it’s own right, Zika, Guillermo claimed, is a cousin, if not a brother, of the Dengue virus, and both diseases are carried by the same species of mosquito. Whatever you do to understand Zika, you’ve done for Dengue, and vice versa. If that isn’t a good enough reason to work on Dengue, he said, he wasn’t sure what is.

Copy of IMG_2487.jpg

Anne-Mieke Vandamme, a professor at KU Leuven and head of the Laboratory of Clinical and Epidemiological Virology called in from Lisbon to give a talk about mapping epidemics. Using phylogenetic trees, scientists can reconstruct the origin and development of a virus outbreak. After her presentation, she introduced Daniel Balog, a senior software engineer at Luciad who she had previously collaborated with. Daniel gave a demo using Luciad software showing an animation of the Ebola outbreak in Sierra Leone, Liberia, and Guinea.

Copy of IMG_2628.jpg

Then, Stefan Pauwels and Tom Crauwels gave a demo of the software products from Luciad. Though most of their software is geared toward military and aviation use, the technology that makes visualizing position updates every second for millions of points possible has applications beyond the scopes of those industries. For the hackathon, Luciad will be offering the free use of their software, and will also provide a training workshop in preparation for the event.

Copy of IMG_2687.jpgTom Crauwels

Stefan Pauwels

Jeroen Dries from Vito, then discussed how data satellite pictures can be used for the hackathon to fight dengue. Vito operates a Belgian satellite to take daily images to create a time series, combining these images to create a global time series analysis of how an area has been evolving. They’ve built an application focused on these time series that includes meteorological data from each country, which is of particular importance for the hackathon. For this event, Vito will provide us with a cloud platform that has access to a Hadoop cluster for processing their satellite data.

IMG_2731 (1).jpg

Guy Hendrickx from Avia-GIS presented their research on Dengue, where they mapped the Tiger mosquito. In the 90’s, Guy was one of the first people to use satellite data to model tsetse fly distribution and the diseases they transmit. In 2010 for the European Center for Disease Control, Avia-GIS began developing a database for the network of mosquitos, ticks, and sandflies all over Europe and producing maps of these different species every three months. Avia-GIS are also generously providing the free use of these databases for the hackathon.

Copy of IMG_2775.jpg

Finally, Pierre Marchand presented from Teradata. Put in the unfortunate position of being the last barrier between a room full of hungry people and their pizza, he made his presentation quick. Teradata will be providing the free use of their Aster platform for storing and modeling the data, and will be providing training on using this platform in the coming weeks leading up to the hackathon.

Copy of IMG_2791.jpg

And, at the end, there was pizza, beer, and networking.

Copy of IMG_2822.jpg

Again, we’d like to extend an enormous thank you to the speakers at the event and for the previous and ongoing support provided by the organizations involved. You can view pictures of the event on our facebook page and videos of the presentations on our youtube channel.

Data Science Bootcamp: Week 2

My name’s Alexander Chituc, and I’ll be your foreign correspondent in Brussels, regularly reporting on the diHub and the data science community here in Belgium. I’m an American, I studied philosophy at Yale, and I’m one of the seventeen boot-campers for the di-Academy.

We started the second week of the Data Science bootcamp developing some more practical skills. The first day was devoted to learning about building predictive models using R with Nele Verbiest, a Senior Analyst from Python Predictions. The second day, we worked with Xander Steenbrugge, a data analyst from Datatonic, learning about Data Visualization using Tableau Software.

Day 1: Predictive modeling

Nele told us to think of predictive modeling as the use of all available information to predict future events to optimize decision making. Just making predictions isn’t enough, she said, if there’s no action to take.

The analogy used throughout the training was that developing a predictive model was like cooking. We can think of cooking for a restaurant as having five general steps: take the order, prepare the ingredients, determine the proportion of ingredients to use and how to cook them, taste and approve the dish, and finally, serve the dish and check in with the customer. We can translate this into five analogous steps for preparing a predictive model: project definition, data preparation, model building, model validation, and model usage.


We were given a lab in predictive modeling in R, providing us with hands-on experience with the methodology and techniques of predictive modeling. A sample dataset was provided, and the lab walked us step by step through the process of developing a model to detect the predictors that determine the likelihood of whether a customer will churn (for those outside the biz, a churn rate is the rate at which individuals leave a community over time, in this case that means canceling a subscription with a telecom provider). This lab took us through all five steps of the process, and along the way we cleaned data, replaced any outliers, went over the basics of model building, discussed the danger of over-fitting a model (the analogy here was recording a concert — you want to record the music, not the sound of the audience, conductor’s baton, or pages turning) and how to simplify a model to prevent this. We went over decision trees, linear regression, logistic regression, variable selection, and how to evaluate your model.


There’s obviously a lot more detail I could get into here, but if I had to write about all of it, I’d never get the chance to write about day two.

Day 2: Data Visualization using Tableau Software

The second day, we immediately jumped into how to use Tableau software. Considering just how much it’s possible to do with this program, I was surprised by how intuitive and and easy to use it was. Managing data is extremely simple, and to create a graph you simply set the parameters, select the graph type, assign data to the columns and rows, set any filters you might want, and choose which data you want to visually represent by color, size, or label.

Xander walked us through how to create the dashboard below, demonstrating the sales of a sample superstore geographically, showing which quarters and departments had the most sales, as well as the average shipping delay for each category and subcategory. tableau dashboard.jpg

After lunch, we were given a dataset and an image of a desktop, and asked to recreate it ourselves in Tableau. After learning the basics with Xander, it was nice to be tossed into the pool to get some real practice swimming:

dashboard 2.png

If you’re interested in seeing more of what Tableau software is capable of, here’s an example of an interactive graph from their website, where you can explore Global Nuclear Energy Use. You can explore the entire gallery here.

Thanks again to Nele Verbiest and Xander Steenbrugge for being such great teachers, and expect a post on week 3 soon.

Bayes in Action

During my coursework in Philosophy, we devoted a lot of time to discussing Bayes’ theorem. Two fields find it particularly important, the Philosophy of Science and Epistemology, or the study of what knowledge is. It’s considered a pillar for rational thinking and increasing our understanding of the world, and it’s fundamental for evaluating claims given the evidence we have. Bayes’ theorem looks like this:

codecogseqnBayes’ Theorem

To put it simply, Bayes’ theorem describes the probability of an hypothesis or event based on relevant conditions or evidence. This equation might look complex, but it’s actually quite easy to understand after a little bit of translation. ‘P’ stands for ‘the probability that’, ‘|’ is a symbol that means something like ‘given that’, ‘A’ stands for a hypothesis, and ‘B’ stands for an event or evidence that might impact the likelihood of the hypothesis. When we understand it this way, the equation reads: the probability of a hypothesis given some evidence is equal to the probability of that evidence given the hypothesis, multiplied by the probability of hypothesis, and all of this is divided by the probability of that evidence.

An example can clear things up. Let’s say you check WebMD because you have a nasty cough. You see that having a nasty cough is a symptom of cancer, and that the likelihood of having this cough if you have cancer is very, very high. If you had cancer, this nasty cough is exactly what you would have expect to see, so it must be pretty probable that you have cancer, and like most people who visit WebMD, you walk away convinced that you’re dying. Bayes’ theorem helps us see why thinking this way is a mistake.

Let’s fill in the equation with some numbers we made up. Let’s assume the probability that you have the cough given that you have cancer is very high: 95%. But, you’re a young and healthy person, so at your age, only one in a hundred thousand people get this kind of cancer. And again, lets assume having a nasty cough is pretty common, it’s cold season after all, so one in a hundred people have a nasty cough. Filling it in, we get this:


So, if we do the math, we come up with your probability of having cancer given that you have a nasty cough: 0.00095, a pretty small chance.

The application of Bayes’ theorem in the field of medicine is extremely useful, especially when considering the accuracy of tests and the likelihood of false positives or false negatives, and there are countless other practical applications for it.

Bayes’ theorem is quite simple, but it’s application to the field of statistics, or Bayesian Statistics, is quite complex, and it’s an important part of how Google can filter search results for you, how your email can detect spam, and how Nate Silver could accurately predict the 2008 presidential election in the United States.

A PhD in Astronomy, Romke Bontekoe typically offers his course on Bayes in Action in Amsterdam, but on October 20th, he’ll be offering his training here at the European Data Innovation Hub. The training is geared towards managers and researchers who want to understand Bayesian Statistics and its application, but the course is open to anybody interested.

If you’d like to learn more about Bayes’ theorem, you can look at this video that I animated for Wireless Philosophy, and if you want to learn more about Bayesian Statistics and its application, register for the training on the di-academy’s website.


Announcing the Launch Event for the Dengue Hackathon

I’m excited to announce the launch event for the diHack’s Dengue Hackathon, at 6 p.m. on Tuesday, October 11th at the European Data Innovation Hub. We’ll present the dengue challenge, give examples of of how data science can help stop the spread of dengue, provide information about coming events, and leave time for networking. You can view the event on meetup here.


There are over 390 million cases of Dengue fever every year, and half of the world is currently at risk of contracting the Dengue virus. We believe that if we get enough data and data scientists together, we can make a difference in stopping the disease’s spread. 


You can check out our website here, and everyone is invited to the launch event. Don’t forget to share your data or ideas and sign up for the hackathon.