Hackathon Winners: Stream, Using Twitter to Predict Dengue Outbreaks

Team Stream, winners of the Brussels Region Prize for the Best Team Related to Developing Countries, aimed to tackle the problem of Dengue being generally under-recognized by increasing its visibility to the general public. Lead by Ciprian Iamandi, the team consisted of Hannah Pinson, Laszlo Kupcsik, Laurent Exsteens, and the di-Academy’s bootcamper Sabrina Trifi.

In previous research done in Brazil, the paper Dengue surveillance based on a computational model of spatio-temporal locality on twitter by Gomide et al found that the R squared coefficient between personal tweets related to dengue and confirmed cases was 0.9578, an extremely strong correlation. Team Stream thought this would be a promising place to start from.  

Their process was as follows: gather social media data real-time, filter this data on personal experiences, identify clusters from this data in real time, and finally, to visualize the data.

denguehack-end2end

The first step of their process was choosing a region to study. They chose Latin America for two reasons, the relative uniformity of the languages spoken, Portuguese and Spanish, and a larger penetration of internet compared to the global average, 66.7% compared to 50.1%. Their next step was to gather tweets from this region containing the word ‘Dengue’, and then filter these tweets based on the machine learning classification algorithm Random Forest to determine whether they expressed a personal experience of a dengue case to eventually finish with workable data, a set of tweets that contain the word ‘dengue’, are geo-localized to the region they wished to study, and are related to personal experience.

From this collection of tweets, the team used the stream clustering algorithm, d-stream, to create location based clusters that predict the prevalence of Dengue. The next step was to compare these clusters to

Cluster_correlation (1).png

Cluster Correlation

In the image above, the intensity of the color of the region represents the prevalence of dengue outbreaks in that region, and the circles represent clusters identified for a particular region, with the intensity of that color representing the predicted prevalence of dengue in that region.

The correlation between the tweets and the confirmed Dengue cases is -> insert number here.Moving forward, Ciprian and his team hopes to adapt their method to update their algorithm to use streaming, rather than downloaded data. Further, they plan on adding new sources of data, such as Google searches and Facebook statuses, and develop their work into a web-based application, creating an actionable, sustainable, and scalable tool to educate the global public about dengue and its outbreaks.

You can view the video of their presentation at the hackathon below.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s