Judging when to tighten, or loosen, the local economy has become the world’s most consequential guessing game, and each policymaker has his or her own instincts and benchmarks. The point when hospitals reach 70% capacity is a red flag, for instance; so are upticks in coronavirus case counts and deaths.
But as the governors of states like Florida, California and Texas have learned in recent days, such bench marks make for a poor alarm system. Once the coronavirus finds an opening in the population, it gains a two-week head start on health officials, circulating and multiplying swiftly before its reemergence becomes apparent at hospitals, testing clinics and elsewhere.
Now, an international team of scientists has developed a model — or, at minimum, the template for a model — that could predict outbreaks about two weeks before they occur, in time to put effective containment measures in place.
In a paper posted on Thursday on arXiv.org, the team, led by Mauricio Santillana and Nicole Kogan of Harvard, presented an algorithm that registered danger 14 days or more before case counts begin to increase. The system uses real-time monitoring of Twitter, Google searches and mobility data from smartphones, among other data streams.
The algorithm, the researchers write, could function “as a thermostat, in a cooling or heating system, to guide intermittent activation or relaxation of public health interventions” — that is, a smoother, safer reopening.
“In most infectious-disease modeling, you project different scenarios based on assumptions made up front,” said Santillana, director of the Machine Intelligence Lab at Boston Children’s Hospital and an assistant professor of pediatrics and epidemiology at Harvard. “What we’re doing here is observing, without making assumptions. The difference is that our methods are responsive to immediate changes in behavior and we can incorporate those.”
Outside experts who were shown the new analysis, which has not yet been peer reviewed, said it demonstrated the increasing value of real-time data, like social media, in improving existing models.
The study shows “that alternative, next-gen data sources may provide early signals of rising COVID-19 prevalence,” said Lauren Ancel Meyers, a biologist and statistician at the University of Texas, Austin. “Particularly if confirmed case counts are lagged by delays in seeking treatment and obtaining test results.”
The use of real-time data analysis to gauge disease progression goes back at least to 2008, when engineers at Google began estimating doctor visits for the flu by tracking search trends for words like “feeling exhausted,” “joints aching,” “Tamiflu dosage” and many others.
The Google Flu Trends algorithm, as it is known, performed poorly. For instance, it continually overestimated doctor visits, later evaluations found, because of limitations of the data and the influence of outside factors such as media attention, which can drive up searches that are unrelated to actual illness.
Since then, researchers have made multiple adjustments to this approach, combining Google searches with other kinds of data. Teams at Carnegie-Mellon University, University College London and the University of Texas, among others, have models incorporating some real-time data analysis.
“We know that no single data stream is useful in isolation,” said Madhav Marathe, a computer scientist at the University of Virginia. “The contribution of this new paper is that they have a good, wide variety of streams.”
In the new paper, the team analyzed real-time data from four sources, in addition to Google: COVID-related Twitter posts, geotagged for location; doctors’ searches on a physician platform called UpToDate; anonymous mobility data from smartphones; and readings from the Kinsa Smart Thermometer, which uploads to an app. It integrated those data streams with a sophisticated prediction model developed at Northeastern University, based on how people move and interact in communities.
The team tested the predictive value of trends in the data stream by looking at how each correlated with case counts and deaths over March and April, in each state.
In New York, for instance, a sharp uptrend in COVID-related Twitter posts began more than a week before case counts exploded in mid-March; relevant Google searches and Kinsa measures spiked several days beforehand.
The team combined all its data sources, in effect weighting each according to how strongly it was correlated to a coming increase in cases. This “harmonized” algorithm anticipated outbreaks by 21 days, on average, the researchers found.
Looking ahead, it predicts that Nebraska and New Hampshire are likely to see cases increase in the coming weeks if no further measures are taken, despite case counts being currently flat.
“I think we can expect to see at least a week or more of advanced warning, conservatively, taking into account that the epidemic is continually changing,” Santillana said. His co-authors included scientists from the University of Maryland, Baltimore County; Stanford University; and the University of Salzburg, as well as Northeastern.
He added: “And we don’t see this data as replacing traditional surveillance but confirming it. It’s the kind of information that can enable decision-makers to say, ‘Let’s not wait one more week, let’s act now.’”
For all its appeal, big-data analytics cannot anticipate sudden changes in mass behavior any better than other, traditional models can, experts said. There is no algorithm that might have predicted the nationwide protests in the wake of George Floyd’s killing, for instance — mass gatherings that may have seeded new outbreaks, despite precautions taken by protesters.
Social media and search engines also can become less sensitive with time; the more familiar with a pathogen people become, the less they will search with selected keywords.
Public health agencies like the Centers for Disease Control and Prevention, which also consults real-time data from social media and other sources, have not made such algorithms central to their forecasts.
“This is extremely valuable data for us to have,” said Shweta Bansal, a biologist at Georgetown University. “But I wouldn’t want to go into the forecasting business on this; the harm that can be done is quite severe. We need to see such models verified and validated over time.”
Given the persistent and repeating challenges of the coronavirus and the inadequacy of the current public health infrastructure, that seems likely to happen, most experts said. There is an urgent need, and there is no lack of data.
“What we’ve looked at is what we think are the best available data streams,” Santillana said. “We’d be eager to see what Amazon could give us, or Netflix.”