Recently Jake Porway, the founder and executive director of DataKind, visited the NYC Data Science meetup to talk about his organization’s work using data for good. DataKind is a nonprofit that pairs data scientists with mission-driven organizations, and Jake gave a few examples of fruitful efforts that mapped indicators of childhood wellbeing in Washington, D.C., or figured out whether widespread tree-pruning makes New Yorkers any safer.
Jake also spoke about some of the lessons learned in working with myriad organizations where data science has not historically been a priority, and closed with some thoughts about the promise and potential pitfalls of data science.
The slides and video from the talk are embedded below, along with a timestamped summary of Jake’s talk. If you’d like to be notified about the next Data Science meetup, join our group!
[1:15] Crisis of communication
[4:10] “It really seems like we have these superpowers”
[6:10] “Every field is having its data moment”
[7:05] Why we founded DataKind
[7:55] Example project: Childhood wellbeing in D.C.
[10:10] Example project: Trees in NYC
[12:15] Example project: Poverty maps
[16:45] Increasing data literacy
[17:30] “Our job is to demystify”
[18:30] Data in the social sector
[19:25] Activating genius
[21:00] Visualization is a process
[23:40] The future
[26:45] Data for good, from unexpected places
[30:35] Ethics and PR
[32:40] Closing thoughts: The Macroscope
[0:09] Thank you so much, everybody. Thank you K and John and Meetup for having me. I don’t know if King of All Media is a title I can live up to, but if I can make it out of this meetup making you guys a little happier, I’ll take that. So as John mentioned, I run DataKind, a non-profit using data for good. I see some faces in the crowd who have actually been to a couple of our events, so some of this may be old hat for you guys, but I’m really glad to see you all here, and I’m just going to start out with a little bit that’s about the story of DataKind.
[0:38] This is going to be a fairly non-technical talk, so I will not be offended if you get up and leave and go to the Machine Learning meetup down the street; however, I would like to sort of just give you an overview. We’re going to talk about some things that are top-of-mind to me and to us right now, and then we’re going to open it to conversation because I talk about this all the time, I already know what I think about it. I would rather hear what you guys think about it — because what I think is so cool is that there’s actually a meetup here for data scientists. Not really a lot of places we can go to coalesce, and so I’m really excited that K has gathered us here to figure out maybe how to think about these big ideas in our field.
[1:13] I’m going to start out with a talk that I usually give to non-technical audiences, and I’m going to begin there because I think we’re actually a profession that is a bit in the middle of an identity crisis. We have a crisis of communication, and so to sort of show why, I wanted to actually have you guys do an exercise. I’d like you guys to just picture data — right, when you hear that word “data,” or “big data” — go ahead, just close your eyes. See what you picture when you hear that. You probably think of your day job. And then just open your eyes and just shout out, what do you picture when you hear that?
[2:00] You guys are nailing it — so, spreadsheets, .CSV files. I heard someone say servers, like horrible servers, or even worse, as John said, the tunnel of binary — like, we all live in the Matrix, right? This is horrible. I just found a great Tumblr that Mike Dewar actually sent to me called Big Data Explained, and it’s all pictures of Big Data in marketing. Look at this; this is tragic. This guy is lost in a sea of binary, it’s so sad! The biggest takeaway I get from this is Big Data is blue, definitely blue. And this is so sad because I actually find myself thinking about data in this way. I picture this, and that’s pathetic because data has become personal and so pervasive that it actually doesn’t take much to realize how much Big Data is changing our lives and to actually think back to — like, go back to the Dark Ages, way before we had technology and Big Data, back to the year 2000, where this is what it would look like to get a movie. You all remember that, right? Going to Blockbuster, which sucked and was horrible and we all just lived in this horrible dark time, but of course now we have Netflix which lets us pick wonderful movies; we’ve got Google Maps which gets us from Point A to Point B, and Amazon which gives us recommendation engines.
[3:35] This is the stuff I’m excited about. People, often, outside of our profession don’t really understand how much is driven by data science, and really is driven by us, data scientists. And I love that the logo for the data science meetup is a dude pulling open his shirt with a Superman logo that says “DS” on it, because without hyperbole it really seems like we have these superpowers, that we have the ability to look into massive amounts of hairballs of data that people can’t handle; we can see patterns; there is really some amazing stuff we have. We’ve just come to a time where it’s a moment for us to all step up. Drew Conway actually said to me, who I think spoke here a couple times ago — we asked him what he thought of thinking about data science, and he said, “you know, data scientists are Green Lantern.” We all have our rings; they’re all a little different, but they all have these creative powers that make wonderful things, and that is an exciting thing about our profession.
[5:04] I think probably where I saw this most when I started doing this work, going to my first hackathon, I was like, “wait a second, I’m sitting next to the best machine learning dude in town.” I’m seeing fantastic code, seeing people from Berkeley, mathematicians from Stanford — we can create stuff outside of our jobs. We don’t need governments to tell us what’s cool; we don’t need our job to dictate what we can do; we can build whatever we want. We’re going stuff that’s super powerful; we’re going to build stuff that’s going to change the world; we’re going to build stuff that has so much impact, and then the stuff we built was so UNFULFILLING! What? An app to park your car, an app to pay your bills. I mean, sure, these are great. I have every one of them and use them everyday, but it just feels like if we have all that technology, the stuff that’s driving the Hubble telescope, we should be using it for more than just making sure you don’t get accidentally recommended “Notting Hill” when you really wanted to see “500 Days of Summer.” So that’s a bummer, but the really cool thing is that the “Big” in Big Data — to me, that is — forget the 4 V’s — Big Data means expansive, and the really cool thing that’s happening right now is that every field is having its data moment.
[6:18] There’s this group called Kilimo Salama. They give insurance to subsistence farmers in Africa via cell phone. It’s a cell phone program; they can opt into it, and they’re really interested in being able to predict who’s going to default on a loan, what the weather is going to do to change insurance prices based on what’s coming in. And they’re just a little non-profit, but all of a sudden they’re a data company. They’ve got data pouring off of this cell phone platform they’ve set up, the most ubiquitous platforms in the developing nations. So, all of a sudden they could use those same techniques that we use everyday to recommend movies, to recommend products, and they could have that same insight behind those same interesting patterns, and they have nobody to help them do that because they can’t afford a data scientist. I know what you guys get paid, and even if they could, they wouldn’t necessarily know what to do with one because this is totally new. They haven’t been built up to be a data science company. So, that was why we founded DataKind was to fill this gap, to get data scientists who were already spending their time working on projects on the side to plug those projects in with non-profits, social organizations, governments, others who suddenly found themselves awash in data and could use these skills for good but didn’t know where else to turn.
[7:27] In that time, we’ve been finding ways to basically cater to any kind of ability that’s out there. If you just have a couple of hours that you can give on a weeknight or a whole weekend event, or even sign up for our longer-term projects with a thing we call the DataCorps for nine months part-time, we want to make it easy for people to plug in and work together. And that way, the hope is that we give data scientists this chance to have social impact; we give social organizations a chance to maximize their impact, and in the process we all get to live in a better world. So, let me show you a few examples before we jump into the cooler things we find from this.
[7:59] The first comes from a group called DC Action For Kids. Their job is to look out for child wellbeing in DC, so every year they look at the state of children and how they’re doing, and they put out a report. This year, they said let’s do something different; the government has opened up huge amounts of data, tons of files and databases about child wellbeing and things related to parents’ income levels or food deserts. So, they said could we get that together in a way that could help other people and policy makers understand what’s going on with child wellbeing. And they thought, we could! It’s all available. It’s great, but then they remember, data looks like this — which is fine for us, but which is totally perplexing if you are a small non-profit. This is a group of four people without data science expertise — and I’m just showing a snippet from hundreds of .CSVs, .PDFs that they had, stuff that they just wouldn’t know how to wrangle. So instead we teamed them up with these guys; this is Max Richman who is at GeoPol now, a couple of people from the Washington Post, and they took all these spreadsheets and they layered them together to make this interactive visualization of child wellbeing on a map. This was their 2012 eDataBook. The basic idea was that no matter what level of ability you had, you could understand this tool. So, if you were an executive looking for an executive summary, you could read their findings that in fact Southeast DC was doing worse than the media was saying it was doing. If you wanted to poke around, wanted to change the layers on the map and actually read into — or, if you wanted to actually analyze the data yourself — they had taken all these disparate data sources and normalized them into a single database at the neighborhood level so you could analyze it yourself.
[9:32] When I show this, there’s almost a moment of embarrassment, in a way — embarrassment is too strong, but you know, I imagine people are sitting out there thinking, interesting to put it in on a map. It’s not that big a deal, but what’s crazy is we saw them present this to their constituents, other people doing child wellbeing and tracking, in front of the Mayor of DC, and there was an audible gasp in the auditorium because up until now, everything had come out as a .PDF, and now all of a sudden they could interact with the data. People were coming up, eyes full of tears like “I never knew it was possible!” So, this is amazing with a very small non-profit, never would otherwise have this ability, could actually start working with tools like this.
[10:14] Another example comes right here from New York City with the Parks Department. You may not realize, but the New York City Parks Department has data about every single tree in the urban forest, which is what they call the Trees of New York. Don’t laugh at that, it’s beautiful! … But it’s also kind of funny. They track every single tree; they track when it gets planted, what type it is; they track any maintenance, but mostly they just use that data to report on it. And so they wanted to know, could we analyze this data and learn about how the city is functioning? In a particular way, they were looking at this one program where they prune tree limbs. So, tree limbs are hanging over the street and they go, “Oh, we’ve got to cut that down,” so that it doesn’t fall on a car or a grandparent or something. We’ll cut it down to make things safe. Problem is, they have no evidence that that’s actually making anything better — they just do it by gut. So, they said could we use this data to understand if we cut down the trees, would there be a difference?
[11:18] I teamed them up with a couple of people who are at this event that we ran led by Brian Dallesandro — if you don’t know him, he actually works an online ad company, and his job everyday is to build these causal models of if he shows you an ad, do you do something different or not than if he didn’t show you the ad? Right, basic experiment, but actually lends itself perfectly to this question — if we trim the tree limb, were there a statistically significant fewer amount of tree emergencies on those blocks than similar blocks or no? And it turned out that not only did they build this wonderful tool that lets New York City Parks actually interact with their data and see where all the trees are — they’ll just kind of plant them and forget about — but they actually came up with a number, 22%. 22% fewer tree emergencies on those blocks where they pruned trees than on the similar blocks where they didn’t. And this was great for Parks because they could say “We’re making a difference; we know how to more effectively target our resources.” Cities from around the country started writing New York City saying, “how did you do that?” If we had this data, we could do that, too.
[12:19] I’ll close with one last example that comes out of the World Bank, and the World Bank’s mission is to eradicate poverty, and the way they do this is no small, small task. The way that they do this is by building these things called poverty maps. The basic gist is that a poverty map is a census of the income levels of people within a country or region. Now, if you were a policy expert, why or why not — why would or wouldn’t you want to use a map like this to inform poverty policy?
[13:48] Yeah, sorry to cut you guys off, but that’s exactly what I was thinking — it’s wildly coarse. You have one number in this part of South Africa. What does it mean if South Africa is 87 poor? Is that really enough? The point is this is just not enough data. It takes them about five years to build one of these because they have to go out on foot, collect that data, and then you come up with one number per country. Are there ways that we can get finer grain data for this problem? One of the things they really wanted was food prices. Food prices are really hard to get and are really important for setting policy; they can prevent inflation. So I said man, if only there were some way we could get food price data, but we don’t know where to get it. So we teamed up with a couple of data scientists, and I said, well, there’s a couple places we could think of: supermarket websites. This is Pick and Pay, one of the biggest supermarkets in South Africa, and there’s tons of data. They update every day on the products that they’re selling. There’s also a bunch of mobile sites that record mobile transactions of people buying things in markets. So, the team put together a huge thousand-day retrospective of the daily food prices every day, and they did this subnationally so they could make this kind of boring-looking but important graph which showed the price per person per day for these 11 crucial items within each country. And what’s cool about this is the people at the World Bank had never seen this before. There was no way of getting this fidelity of data. A graph that I didn’t show is one in which the team was actually able to see an impending rice crisis; you’d see the price of rice sort of tipping up really quickly that wasn’t being represented in the data that the World Bank had before that.
[15:50] It’s a small thing. It doesn’t sound like all that much, necessarily, but these are huge changes to social organizations that don’t have these skills and just really wouldn’t know where else to go. I think these are really interesting things that we’re seeing. And beyond just looking at the projects, there are tons of other projects for me to talk about. I am going to try to be quick so I can just talk with you guys. The other projects that we could talk about don’t even cover the fact that there are bigger things that we’re seeing. We can talk about the results of this, but other things that we’re seeing are groups are scaling these results. We mentioned the New York City Parks department — other cities are starting to glob onto these ideas and grow them into their own areas. We’re also seeing just excitement around the pro bono data science movement. A lot of people show up and say that I didn’t know I could spend my time making a difference. So, having these projects actually works as a platform to show people you can contribute. And then the other thing that’s really exciting to me is you start to see increase in data literacy in the groups that we’re working with, which may seem like a small thing, but right now there is a huge divide between we data scientists and organizations who don’t have data science skills. And to see groups coming back from these projects and saying, “I now know how to ask questions about data science,” or “I know the reason to fund a data scientist,” is huge. So, we’re really excited that this is sort of making a switch in the demand for the market toward data scientists in the social sector.
[17:31] That’s sort of the gist with DataKind, but before I wanted to open it to questions, there are a couple of things that are on our minds for the future, in particular around what we’re seeing out there even beyond just DataKind. I want to talk about some of the learning lessons that we’re seeing that if anyone else wants to do data science for social good should be totally prepared for.
[17:33] So, first thing we’re seeing is that our job is absolutely to demystify. I think it is such a shame that there is a market demand to keep big data confusing, because as long as it’s confusing, you can sell people things to make it less scary, and that’s fantastic to many unnamed companies in the world. It’s more important for us to remind people that yes, this is a highly technical skill, but it’s not magic. It’s not — we don’t just wave our hands and good stuff comes out, especially because most people show up, especially at DataKind, and say “hey, we would like some magic.” Particularly when they show up with stuff like this (slide at 18:29). People have said that data in the social sector is messy, but it really is messy — like, really. I’m going to say it again because so many times we have volunteers come in, and we say look, there’s a group of organizations who have not had a data infrastructure; they don’t necessarily have data science experts on staff, and it’s going to be bad. And they go, cool, got it — and then two weeks later after we give them the data, they’re like “Oh, my God. It’s so bad. There are so many formatting errors! This doesn’t do the thing — the keys don’t actually align — what happened!?” And I think it’s really important to note that if you’re going to go off and do data for groups that aren’t tech companies themselves, that whole infrastructure isn’t going to be there, and if 90% of data science work is cleaning the data, then 99% of the work in the social sector is cleaning the data, so I would just be prepared for that. It’s one of the most fun parts, to me, but not something you want to be caught unawares of.
[19:26] And then one of the other big things that we’re seeing is this idea of activating genius. What we’ve found is that people just really want to do interesting things. I’m probably preaching to the choir that’s converted here, but we’ve found that DataKind is a lot less about doing data science and a lot more about getting out of people’s way and just making it easy for them to do something that has purpose or a way to use their skills or a chance to master something new. So, that is something really important, something we’ve learned a lot. I don’t know how relevant that is to you guys, but I think if you’re also looking for good data science for good projects, keeping in mind that this is something where you work with volunteers; it’s something that drives them. People just want to give back.
[20:51] And then a general sort of data science adage that always comes up, certainly even before doing DataKind work, what comes up more than the magic thing is this idea of visualization as being a process and not just an end. I’m sure lots of people here throw up charts and graphs as they’re doing work, but I think this is so important to emphasize on really everyone, whether it’s a non-profit or a for-profit. This is some, a graph showing something that is exciting and beautiful and technical, but it’s not really clear what it means. And not only is there a need for communication design to show people something that they can understand what it means — I think the bigger point here is that visualization isn’t something you do at the end. I know I went to grad school and the idea was that you do all the work; you get the results, and then you put all the pretty graphs in the last section, and that’s what everyone flips to anyway. Now, of course, Data has gotten so huge and incomprehensible that it’s important that visualization can be part of that process. Jer Thorp, I think, has really mastered this. He has one of the coolest titles I’ve ever heard; it’s called Data Artist. He built this visualization of the Kepler data (slide at 22:04).
[22:04] For those of you that aren’t familiar, Kepler Satellite went out and looked for exoplanets, those planets that are like earth that are out in the universe. They brought back the data, they found like 800 exoplanets — places where live could exist — and they put it in a table in the back of a .PDF! It was sad! How can you relate to HR-45 and know that it’s — that doesn’t tell me anything. So, Jared built this interactive visualization that instead actually visualized that data based on what Kepler got back. What you’re seeing here is these are all of the exoplanets, all aligned as though they’re all around a single sun. That’s the star that they would all go around — of course, they’re not all actually around the same one. He’s actually broken them out by heat and by color, but what I love so much about this he always builds in ways to do transitions. It’s so important that he can look at the data among all these different dimensions. I think this is something that is incumbent on us data scientists because we think about it all the time — how are we not just presenting our results at the end? How are we looking into the data in ways that we can understand new things and find new patterns, because this is a whole new art now that our medium has grown so much?
[23:28] Those are some quick things that we’re finding, and I’m going to bring them up because, again, they’re the four things that people come to us with the biggest questions about, which is what the heck is Big Data, how can you turn messy data into good stuff, why should I visualize things, and why would anyone volunteer? Lastly, the thing I did want to close with that I really wanted you guys to help me talk about is the future. The world is changing super quickly, and if we came back and had this talk three months from now, we’d probably talk about totally different things.
[24:02] Number one, a friend of mine had this phrase that I love: Big Data will soon seem quaint. It seems so charming when we use the term Big Data, because Big Data is pouring so much exponentially faster all the time, and there are two examples that I really, really like of this. The first is this guy, Mike Snyder, the Head of Genetics at Stanford. He wants to design a device that takes your blood at every hour and measures about 12,000 different indicators.
[26:00] Who is going to be there to catch [all this data]? Who is going to be there to do something with it? People who design watersheds and research water pollutants? No, us! If you’re wondering what the rest of that blank is, it’s us, and no one else is going to know how to do it. So, data is just going to get bigger and mightier and awesomer. The other thing I wanted to bring up is data for social good comes from these really unexpected places.
[26:51] Who else has seen this project? “43,123 pools I have not visited and never will.” Anyone seen this? Basically, this guy was an artist flying into Los Angeles, looks out his window on the plane and sees all these swimming pools and thinks, “Man, there are a ton of swimming pools in Los Angeles. I wonder how many pools there are.” So, he gets a couple of computer vision experts and they start mining GoogleMaps, and they build a classifier to identify every pool and look at its dimensions, and they build a list of every pool in Los Angeles. You can actually buy a coffee table book that is “Every Pool in Los Angeles.” Then they went another step further, and they animated a tour through Google Streetview and actually visiting each of the pools.
[28:18] I thought this was just a cool project, number one. We’re now at a time with the technology to answer ridiculous questions like that. What I throw out there is someone actually found a use of this for social good. I throw it out to you guys — why would having a map of every swimming pool in Los Angeles ever be used for anything socially positive? Now, they use a similar technique in malaria research to scan for those pools and eradicate them, drastically reducing malaria rates in those areas. Even goofy art projects around the availability of data can actually build some social good projects. I think that’s fantastic.
[30:36] And then the really last one — how many people know what this graphic is referring to? For those who didn’t raise their hands, this is a chorus, the infamous story of Target knowing that someone’s daughter was pregnant before the dad did, because of course it was just doing a dumb recommendation based on what she had bought in the past. So, of course when people buy vitamins anticipating a baby, it recommends you diapers because a baby is coming up. But of course, this led to a whole firestorm around the creepiness of Target knowing this data. I think this is really, really, really important because when I hear about Big Data — like, when my mom talks to me about Big Data, the two companies she talks to me about are Target targeting and Facebook running experiments on people. Dangerous! That’s bad for us. We’re a young field, and we can get blown out of the water by being seen as careless, money-obsessed or metric-driven Big Brother-type purveying when, in fact, the work that we’re doing could be so good and so beneficial, and I think it’s incumbent on us to think of how we’re perceived in the world. Especially when we’re doing these data for good projects, the ethics of this — that’s a much longer topic that I’ll just sort of tee up. When you think about the logistics of getting volunteers that work at companies like Google and Facebook to work with the World Bank and create maps that are going to be used to inform poverty in Sudan, there are a lot of moving parts. If that data is incorrect, who is responsible?
[32:40] Just to close, I think what ties all of this together for me — like, if you think of The Internet of Things and all this data, and if you think about the ways that we can do all this good with things we couldn’t before and see patterns we couldn’t before and how we talk about this — what we understand, it’s really summed up in this example that a friend of mine pointed out around about 1980 called the macroscope. This guy was looking at the world and he was saying, “you know, we invented the telescope to look at the infinitely great and we invented the microscope so we can see the infinitesimally small in front of us, but what we’re missing is a macroscope to look at the infinitely complex: the patterns in society and nature which heretofore are completely unseen,” and that to me is what data science is capable of doing. We are now able to see those things, and it’s up to us to find the tools and technology to build that macroscope. What we’re doing, I would argue, is that macroscope and how to tell that story back in ways that we can invigorate people and empower them to make society better and nature. So, that’s really the gist of what I’ve come here to say, and I’ve chosen this picture in closing because not only was this at one of our DataKind events, but this is how I feel about doing data science. I would hope that we all would, as well — and particularly not just about the fun things to do and all the technologies, but in the ways that we all gather in doing these things for good, in the recognition that we really are superheroes, ordinary people with extraordinary powers that can use them for good. As John mentioned, those that are not in New York, DataKind did just expand into five new chapters. So, we now have chapters in D.C., San Francisco, in the UK, Bangalore, Singapore, and Dublin. If you’re visiting any of those areas, stop by. If you live there, please get involved — but even if you don’t get involved with DataKind or come hang out in New York, aside from that, just get involved in doing good with data. There’s so much we can do. It doesn’t have to come through us, and it’s really incumbent on us to get involved, because that’s the only time that we, together, can use data to make better decisions not about the kind of movies we want to see, but instead decisions about the kind of world we want to see. So, I will stop there. Thank you very much for your time.