Coronavirus and Data Science: A Conversation with Anthony Goldbloom
Steven Cherry Hi, this is Steven Cherry for TTI/Vanguard.
In the midst of the coronavirus crisis, in the short term we need protective gear, testing, contact tracing, testing, hospital capacity, and more testing.
In the long term, we need to understand the virus better—how contagious it is, who is most vulnerable, where is it most active, what are its true symptoms, what are the best treatments, does it confer immunity—and, of course, ultimately, we need a vaccine.
There’s hope that a concerted application of data science can answer, or at least support investigations into, many of those questions.
The epicenter of the coronavirus keeps shifting—Wuhan, Seattle, Italy, New York, a naval ship here, a meatpacking plant there, nursing homes everywhere—but for almost a decade now the epicenter of data science hasn’t moved at all: It’s located in a ten-year-old,
now-mature, one-time San Francisco startup, now a part of Google, by the name of Kaggle.
Its founder, Anthony Goldbloom, is a former econometrician at the Reserve Bank of Australia and at the Australian Treasury. He’s spoken at TTI/Vanguard twice, and even before that he was a podcast guest on my old show, Techwise Conversations.
So I’m very happy to say, Anthony, welcome back to the podcast.
Anthony Goldbloom Thanks. Thanks for having me back.
Steven Cherry Antony, when you started Kaggle, it was based around that kooky idea that if an organization had a bunch of data and a question whose answer was buried in that data, it could post the data and the question and a prize for the best answer and data. Scientists would flock to the site and compete to win that prize. And the honor of having had the best answer is that still how Kaggle works?
Anthony Goldbloom That's still a component of what Kaggle does. So we as you said, machine learning competitions was our first product. It's still a very, very popular part of Kaggle. But we've expanded into a couple of other products as well that are adjacent to machine learning competitions. We have a hosted notebook environment where people can come and run their Python or our data science code, and they can also share that code in a reproducible way so that somebody else can take their work fork it and extend it. And then we also have a datasets platform where, you know, just like you go to YouTube to share videos with each other on Kaggle, you share that, you can share datasets with each other so you can upload a dataset that is then available to the rest of the community. Those are the three main parts of KAGGWA competitions, data sets, and notebooks.
Steven Cherry So let's talk about what data science can tell us and maybe can't tell us about the Coronavirus and Covid-19 and many of these cases, there's actual work at Kaggle being done to learn some of his answers or in some cases there could be if we had the right data.
format, which is a very nice format for machines to be able to read. And then they put that dataset of I think it's now about fifty-one thousand articles on top of Kaggle and our community—they challenged our community to answer key questions about the about Covid-19. So these questions range from what is the incubation period of the virus through to well what what is the odds ratio on the key risk factors? And so what our community has been doing is basically going through these papers and answering these questions. And we have compiled their answers into something that we're calling an AI-powered literature review. And so you can go to kaggle.com/covid-19-contributions. And you can see this literature view, which goes through question-by-question and gives you a sense for what the scientific literature says, in response to each of those questions to the extent that there are answers at all.
Steven Cherry So do we seem to be learning more about the pandemic in terms of geography, demographics, and so forth, through these datasets?
Anthony Goldbloom I would.... So far we have covered—this AI-powered literature review—has covered about 5, a little bit over 5 percent of the literature. So I think there somewhere in the order—since February 1st, which I think was the first coronavirus death in the U.S., If I'm not mistaken, or maybe case in the U,S., that was milestone date for a reason that I've now forgotten—but since February 1st, there have been somewhere in the order of 5,000 papers written about Covid-19 and how this literature review is has gone through and pulled out answers from 5 percent of those papers. So example things that we have answers to or we have a fairly clear sense for what all the academic literature says about the length of incubation period. We have a very good sense for what the odds ratio is on the different risk factors, so, hypertension, diabetes, COPD [Chronic Obstructive Pulmonary Disease—ed.], having had a previous stroke. We have some answers on—although the literature is a little little has a little less coverage here—is what is the impact of temperature and humidity on the transmission rate? So we have answers on, as I said, somewhere in the order of about 5 percent of the literature is covered so far. And one of our goals with this AI-powered literature review is to cover more and more and more of the literature.
Steven Cherry So eventually that would include information about the disease itself, how it differs from other respiratory illnesses, how treatment has to be different .... Are people looking at that yet?
Anthony Goldbloom The questions that are easiest, that fit best into these AI-powered literature review questions that are very concrete.
What do we know ... here's an example of a question that the literature review is doing well on, and an example of question that would do less well on. So what is the odds ratio on diabetes? How much of a risk factor is diabetes? How much does it impact the risk of a fatality, if you if you have diabetes as a co-morbidity. And we have a very good, very good and clear set of what the scientific literature says on that. I think to a lot of people it's surprising that diabetes is a risk factor on a respiratory illness. This literature review is not nearly as good at answering a question like, "Why is diabetes a risk factor?" That question is a fair bit more open-ended and it probably will... I think there's a good chance it will fit into the framework of this AI-powered literature review. But there's also a chance that it might not.
Steven Cherry So are there any data sets Kaggle pertaining to a Coronavirus that are not part of this literature review, or does everything sort of end up there only in the context of a scientific paper?
Anthony Goldbloom No. So the scientific paper dataset was the first one that really the first way that Kaggle got meaningfully involved in the work on Coronavirus/Covid-19. We now have four projects in total that we've had a hand in initiating or that have been driven by Kaggle as a team. And then we have thousands and thousands of datasets that have been uploaded onto Kaggle where people were sharing analysis in a much more form undirected way.
I'll talk a bit about the challenges that we have initiated. First of all, the other one that is looking very promising is we've run a forecasting challenge to forecast both cases and fatalities. And we have we've run four cohorts through this forecasting challenge. The way we do it is we allow people a week to train their models and then after a week, they lock in their model and then we score their model—as more data comes in, we evaluate their model—the Week 1 and Week 2 models you know, Week 1 was some level of accuracy Week 2 the models got better; by Week 3 the models were incredibly strong. And if you benchmark them, you know, I had a quick look at this the other day, if you benchmark them against models like the University of Washington IHME model, I think we looked at this, the IHME model would have finished about 166th out of, or would become 166th, out of 250 teams in this competition. Now there's a bunch of reasons why a comparison like that is unfair. People on Kaggle are optimizing for a set-loss function; IHME may not have been optimizing for another loss function, but nonetheless, it looks like the forecasts generated out of that forecasting competition are really pretty, pretty, incredibly accurate. And we're exploring the idea of creating, ensembling, a lot of the Kaggle forecasting models and sharing this out as another dashboard that people could look at to forecast. To help add another forecast to the mix on what might happen, what the spread of Covid-19 might be.
We also have a datasets curation challenge where the goal is for our community to be able to support the research community by putting together very rich panels of datasets that can help the research community answer questions, you know, open questions. One really concrete example of a dataset that came out of the dataset curation challenge is somebody in our community has put together has taken every single city or county in the Johns Hopkins University dataset and has mapped the location to the nearest weather station and has pulled out weather data on a daily basis for each of those locations. Now, that could be the basis for a much—at the moment, if you look at the studies that look at the link between temperature, humidity, and the transmission of Covid-19. The studies will do things like they'll look at 100 cities in China, for instance, on a very widely referenced study. There is now a dataset sitting on the Caval Web site that would allow a researcher to look at every single city in the John Hopkins dataset. And so it would allow a researcher to draw a much, much more robust relationship between the link between temperature and humidity and the transmission of Covid-19. And then finally, we have a challenge with Roche, the pharmaceutical company, around pulling out insights from a range of datasets, um, really focused on helping frontline workers. That challenge is at an earlier stage.
And so the first three challenges, I'm really proud of. I think all of them have come out with things that are I think are potentially extremely useful. The Roche challenge is in an earlier phase, we'll keep an eye on that. But we're hopeful that some really promising things will come out of that one as well.
Steven Cherry And so just to be clear, is the way these challenges work is a certain portion of the data set is held aside and then the model gets tested against that data set to see how well it would have predicted what actually happened. Is that fair to say?
Anthony Goldbloom That it's accurate for the forecasting challenge. The forecasting challenge, people are trained on all the data up until a certain point and they lock in their model. And then as we get new data coming in, we keep scoring their models on a daily basis and they're evaluated over a one month period as the new data comes in. The other challenges that we're running are a little bit atypical for Kaggle, they're more freeform challenges. And so they're not challenges in that's we're trying to find a winner necessarily—although we do, you know, recognize some of the more promising contributions—we're trying to direct the community towards things that ... you know, a lot of AI machine-learning people want to be helpful right now and the purpose to these challenges is more to direct people towards the kinds of things that could make useful contributions, that could be used by the healthcare and clinical communities. So I use the word challenge a little more loosely, particularly in the context of the literature review, the data curation, and the Roche challenge.
Steven Cherry So to what extent you have people who are super knowledgeable about the medicine but don't know anything about data science, and then you have data scientists who want to help but don't really know anything particularly about the medicine. How does that get squared away?
Anthony Goldbloom Yeah, particularly on the literature review challenge. It's a very, very valuable to have. We have a very, very large team of medical students. This is being put together by Tayab, Dr. Tayab Waseem, who's an M.D. Ph.D. in Virginia. He has pulled together a very large group of medical students. And they're doing two things. They're helping to define the outputs of the literature review. Once we have a clear sense for what sort of information that's useful to extract out of papers, the machine learning algorithms can go and extract that information. But the medical students are on a per-question basis saying okay, for the incubation period question, "We care about the number of days, the incubation period in days as well as the range." And so they get that, you know, they show that guidance to the machine learning community. The machine learning community then goes and builds algorithms that identify which of the incubation period papers and extract that data. It has been absolutely crucial, to have the large team of medical students doing that and then they're also doing some error-checking. So sometimes the machine learning algorithm will extract the wrong data from a paper. One very common value, for instance, early on was the algorithms would often pull out a result that was a result that was referring to our results in another paper. I think we've gotten that at issue sort of from a lot of the algorithms, but that was a very common failure mode for a lot of the algorithms. It wasn't distinguishing between a result that came out of this paper versus as a result that was referencing a result from another paper. One of the interesting things is, Dr. Tayab has been able to pull together such a large group of medical students, because at the moment, those who are early into their training, it's better that they're not at hospitals, right. And not far along enough in their training to be able to be in hospitals. But they're far enough along in their training that they've all done the evidence-based policy—the evidence-based medicine course. And so that's very fresh for them. And that's why I have been absolutely invaluable contributors on the literature review in particular.
I think with the forecasting challenge, we initially ran it in more of an experimental form. Now that it is looking like it is producing very useful results, we're trying to get some feedback and read a lot of the papers written by the epidemiological modeling community so that if we do put together this dashboard, we're putting it together in a way and using loss functions and putting it together in a way that is consistent with how that community does their modeling. And then on the dataset curation challenge, really that's much more freeform. We don't have a lot of domain experts. And to be honest, I think that's been a drawback of that challenge. As I said, we have this beautiful data set of temperature by county and that is ... currently I'm not aware of anybody writing a paper on that. And I think that we have this problem where we have a lot of beautiful curated datasets that aren't known about in the research community. And so one of the things that would be good for us to accomplish is for a closer connection between the research community and the datasets that the Kaggle community has curated because I think I think there's a lot of value there that is currently unrealized.
Steven Cherry One big topic right now is antigen testing. Do you know if there is much work being done there? Are there are many papers yet about it? Are people asking questions that the dataset could answer?
Anthony Goldbloom Yeah, we have got questions on testing and on vaccination. A lot of the papers—I believe we have a question in our literature review that covers diagnostics. A lot of the focus, though, is on di- different approaches. One might take to diagnostics. So. So I understand it. It's been PCR and RT-PCR. A lot of the papers are exploring other kinds of tests that might impact the speed at which you can get a test done, or the sensitivity and specificity of the test. The papers that have been compiled more on the future of testing than, you know, more immediate questions around how we can get more of the existing tests out to people. That's not a question that we—that has been covered.
Steven Cherry And one of the big reasons we care so much about testing both the regular, whether-you-have-it and the antigen testing, is that we're supposing that having exposure to the virus confers some sort of immunity. But I gather we don't really know that yet. And is that one of the questions that people are targeting, the datasets?
Anthony Goldbloom Yeah, immunity in the mix as a question and I don't believe it is one that we that has been addressed yet. And actually, we've just hooked up with a leading medical journal in order to try and focus us a little bit more—so the original set of questions that we got I mentioned came from the White House Office of Science and Technology Policy. They drew on the National Academy of Science and the World Health Organization in order to compile that list of questions. That list of questions was compiled a little bit over a month ago. And we've just started what I think is a really productive relationship with one of the leading medical journals where they have taken a look at our questions and they have fed back a list of ... So it helped us prioritize which questions we should go after next. In some ways, the original set of questions are a little bit stale. Some of them are reasonably settled science. Some of them are more pressing to get answers to. And I don't remember whether the immunization question was on a list of highly prioritized questions that came back from the last round of review.
Steven Cherry You know, the data science at some point is only as good as the data itself. And there are certain questions that people have that really could only be answered if the right data exists on top of the right analysis of the data. So, for example, people are seeking a herd immunity, not just in general, but very specific. For example, maybe there's only 13 percent immunity in New York City, but among the medical community, which would be the highest vector of transmitting the disease, maybe, maybe the immunity is already 80 percent. To what extent are people finding gaps in the literature that then direct people to try and actually get detailed data like that?
Anthony Goldbloom Yeah, I mean, some of the big gaps in the data that I'm aware of that people are trying to compile better data sets for are things like hospitalizations. So, Johns Hopkins University measure the number of cases they measure, measure the number of recoveries they measure the number of fatalities. And they have very wide coverage. I know there are big efforts to try and put together a dataset with equivalent coverage on the number of hospitalizations. That has been an important question for predicting hospital capacity, hospital-demand type questions. That's a dataset that I'm aware of, that there's a few groups trying to put together. And I don't know. Yeah. You know, to the degree of testing, I think that's I think the people who have done the best job of that are ... there's a website called Our World in Data, and they've done a really nice job of putting together a testing data set. I feel like there are different places who have become very strong for different types of datasets. I've seen a beautiful dataset I think somebody—I don't know if it originated on Kaggle, was crossposted to Kaggle, but this is very important from a modeling perspective: What date did different cities go into lockdown and what was the nature of the lockdown? Was it a school closure? Was it a restaurants-and-bars closing. And so that's another valuable data set that somebody compiled. And again, I forget if it was a Kaggle community member, or it was just crossposted to us.
Steven Cherry There's likely going to be an app that will ride on every Apple and Google phone that will do some of the work of contact tracing. That is, it will figure out who's been in contact with whom, store that in a sort of anonymized way. And then if somebody tests positive, there's the opportunity to know and inform everybody that they came into contact with. Do you have any idea if there will be a great addition to the data sets that already exist, once some of that happens?
Anthony Goldbloom Yeah, it's an excellent question. I do not know the answer. I know that Apple and Google have been extremely thoughtful about the privacy considerations around these APIs. And so I have not looked closely enough at what those privacy considerations are and what data might come out of that ... I just don't have a good answer to that.
Steven Cherry There are other countries that have sort of made the balance between privacy and data availability differently from the way the US has been doing it and it's likely to continue to do it. So, for example, in South Korea, there's a lot more transparency and less privacy about some of this data. Are those data sets in the Kaggle world? And do they provide some additional information that maybe U.S. data sets wouldn't have?
Anthony Goldbloom Yeah, we have a lot of data sets. A lot of people have—some of the most popular data sets on Kaggle, data sets broken—and by the way, these are ones not curated by us. I mentioned we have thousands that the community has uploaded and some of the most popular ones have looked at more detailed data on a by-country basis and sometimes even at the sub-country level. So I know, for instance, there is a very, very popular data set on Covid-19 data in Italy that is at a decently granular level. I believe there is one for South Korea as well. Those datasets became very popular because they were both sort of early places of interest for the virus. And so there was ... people were interested in zooming in on the data available in those cities. Actually, maybe a little bit surprising ... One of the regions—it actually may even be more popular than the Italy dataset, is Covid-19 in India. I think one of the—there are two reasons for that. First of all, Kaggle has a very large population of data scientists from India. And so, you know, there's some additional interest in Indian datasets. Also potentially, I mean, India has been under the spotlight a little bit because the 1918 Spanish flu hit India a lot harder than other countries. And as I understand it, it has a lot—there are a lot of characteristics in India that make it—might make the spread of this virus a bigger risk there than in other places. So those are those are the ... Absolutely we have a lot of much more granular data sets on a bi-country basis.
Steven Cherry It seems like the different countries offer an opportunity to really compare one set of conditions against another. I think it's another weird thing about India that there's a vaccine that's widely used there that's hardly used elsewhere that some people have wondered might convey some sort of immunity that wouldn't be as present in other countries.
Anthony Goldbloom Yeah. And another country of interest is Sweden because they've taken a very different approach to lockdown. I think it's been covered up a bit in the popular press and I'm fairly certain we have some datasets, some Swedish-themed datasets as well for that reason.
Steven Cherry My last question gets back to the question of vaccines, it seems like there's a number of places where potentially data science could really help. One would be at the level of DNA itself, in part because we know a lot about the DNA of SARS, and then generally speaking, people have had some success using data science to create models for drug discovery. And then finally, what we talked about a bit already is the testing of possible vaccines.
Anthony Goldbloom Yeah. Certainly an area where Kaggle historically has done quite a lot of work is in chemical informatics, which is your second example. This is taking chemical compounds and trying to predict—so the idea is you take chemical compounds and you try and produce some of their characteristics like, will they bind to a certain target. Will they have side effects or off-target effects. And that's definitely an area. So that can help at the very beginning of the drug development pipeline to narrow the list of candidates. Kaggle has not been involved in any chemical informatics work related to Coronavirus/Covid-19. And then we have noticed there are quite a few genomic datasets that have been posted on Kaggle. They tend to—they definitely attract the interest, but they tend to attract a fair bit less interest than some of the other datasets. I think the reason is the barrier to entry on those datasets is higher. You need sci—you need a higher level, to my understanding, in order to make use of those datasets. So it could be that there's good work done on those datasets. But certainly in terms of just raw download numbers, raw numbers of people doing analysis on those datasets on Kaggle, it's definitely somewhat lower.
Steven Cherry So I have to ask Anthony, just on a slightly more personal note. You're a very busy guy. You know, in the before Corona virus era, you and your wife just had a second kid and congratulations on that. What was it like to suddenly get this tsunami of data related to a question of the absolutely highest importance to the entire globe?
Anthony Goldbloom I think the when as Coronavirus/Covid-19 started heating up, like a lot of teams, or becoming a bigger and bigger issue has a lot—you know, it's started touching everybody. We were very interested in ways we could it could do something useful and really had absolutely no idea how or what. And we were really lucky to get that inbound request from the White House Office of Science and Technology Policy because they had put together a data set and they had a well-set—they had a very well-structured way that we could make a contribution. And that was really, really nice for us, because it's—I think that there is a decent chance that some of the work that—or I'm proud of the work that Kaggle has been doing. And I think it's making a good contribution. And if not for being given ... that that initial contact with the White House, OSTP gave us a, you know, a first foot in the door to be able to go on to do things that, as I said, I really hope end up having some positive impacts on the pace at which we understand Covid-19 as well as potentially some of our abilities to forecast that and so on and so forth. So I'm kind of quite grateful for that. And it's not clear that if not for the White House reach out, that we would have necessarily found a useful way to make a contribution.
Steven Cherry If the world is to be able to do better at managing this pandemic than previous pandemics, it's going to be because of the blending of science and politics. And so it seems to me that you're right in the center of that and the world should be grateful that Kaggle exists for this purpose. Thanks. Thanks for everything you're doing and for your time today.
Anthony Goldbloom Thanks, Steven.
Steven Cherry We've been speaking with Anthony Goldbloom, founder of Kaggle, the premier site for data science when it comes to Coronavirus and Covid-19.
For TTI/Vanguard, I'm Steven Cherry.
This interview was recorded 23 April 2020.
Audio engineering by Gotham Podcast Studio, New York, N.Y.
Music by Chad Crouch
We welcome your comments @ttivanguard and @techwiseconv
Note: Transcripts are created for the convenience of our readers and listeners. The authoritative record of TTI/Vanguard’s audio programming is the audio version.
Data Science Is Now a Job Market Based Entirely on Merit(IEEE Spectrum, May 17, 2013)