Skip to main content
SearchLogin or Signup

Review 2: "Optimizing Predictive Models to Prioritize Viral Discovery in Zoonotic Reservoirs"

Reviewers find this a rigorous and well-supported approach to target future studies of betacoronaviruses in bats, and raise a few questions (about the model and the data used) for further clarification.

Published onSep 29, 2021
Review 2: "Optimizing Predictive Models to Prioritize Viral Discovery in Zoonotic Reservoirs"
1 of 2
key-enterThis Pub is a Review of
Optimizing predictive models to prioritize viral discovery in zoonotic reservoirs

AbstractDespite global investment in One Health disease surveillance, it remains difficult—and often very costly—to identify and monitor the wildlife reservoirs of novel zoonotic viruses. Statistical models can be used to guide sampling prioritization, but predictions from any given model may be highly uncertain; moreover, systematic model validation is rare, and the drivers of model performance are consequently under-documented. Here, we use bat hosts of betacoronaviruses as a case study for the data-driven process of comparing and validating predictive models of likely reservoir hosts. In the first quarter of 2020, we generated an ensemble of eight statistical models that predict host-virus associations and developed priority sampling recommendations for potential bat reservoirs and potential bridge hosts for SARS-CoV-2. Over more than a year, we tracked the discovery of 40 new bat hosts of betacoronaviruses, validated initial predictions, and dynamically updated our analytic pipeline. We find that ecological trait-based models perform extremely well at predicting these novel hosts, whereas network methods consistently perform roughly as well or worse than expected at random. These findings illustrate the importance of ensembling as a buffer against variation in model quality and highlight the value of including host ecology in predictive models. Our revised models show improved performance and predict over 400 bat species globally that could be undetected hosts of betacoronaviruses. Although 20 species of horseshoe bats (Rhinolophus spp.) are known to be the primary reservoir of SARS-like viruses, we find at least three-fourths of plausible betacoronavirus reservoirs in this bat genus might still be undetected. Our study is the first to demonstrate through systematic validation that machine learning models can help optimize wildlife sampling for undiscovered viruses and illustrates how such approaches are best implemented through a dynamic process of prediction, data collection, validation, and updating.

RR:C19 Evidence Scale rating by reviewer:

  • Strong. The main study claims are very well-justified by the data and analytic methods used. There is little room for doubt that the study produced has very similar results and conclusions as compared with the hypothetical ideal study. The study’s main claims should be considered conclusive and actionable without reservation.



1. RR:C19 Strength of Evidence Claims are very well supported by the data and methods used. Decision makers should consider the claims in this study actionable with limitations, as described by the authors.

2. Comments At a global scale, Becker et al.’s ensemble predictions of likely bat hosts of betacoronavirus seem like the best available. They also provide a means of updating those predictions with new data and several caveats to their use. Because they found that individual model predictions varied substantially, they advise against relying too heavily on any one model of host-pathogen associations to prioritize sampling efforts. Instead, their work supports the use of hybrid models, ensembling predictions across models, or both. They offer a new metric of individual model performance to inform decisions about which models to use and/or to weight model predictions when ensembling. Finally, they demonstrate the value of host ecological traits in predictive models of host-pathogen interactions. The manuscript confirms previous work in trait-based disease ecology and offers some important lessons about relying too heavily on any one model to prioritize reservoir host surveillance and discovery. The evidence, methods, and arguments support advancement of Covid-19 understanding within society. The work is well positioned within current literature and understanding, and the authors discuss several important limitations of their conclusions and recommendations. Their presentation of recommended actions, in terms of future modelling efforts and the use of model predictions, is clear and well structured. I would recommend this manuscript for publication. The only thing that I was a bit unsure about was the difference between the correct in-sample predictions of the Network-1 model (20/21 or 95%) and its performance based on AUC-TPTSC score, which was poor. The conclusion seems to be that its success rate was just a random fluke, but the difference also made me wonder about the test data that was used to calculate the AUC-TPTSC scores. Can the authors confirm that the test data used for the network models was in-sample only (i.e., in-sample for the network models and both in- and out-of-sample for the trait-based models)? And perhaps add this information to the methods? Otherwise, the network models would be penalized for observations they couldn’t possibly predict in their in their AUC-TPTSC scores. As a reader interested in trait-based models, I’d also appreciate a few additional details about those models. For example, did all of the trait-based models use the same set of traits? Could the authors describe those traits in the text (e.g., in the aggregate, like, types of traits included) or in a table (or perhaps reprint the list /description from Han et al. (2016), if this work used the same traits)? In the trait-based models, were there specific types of traits that seemed to be especially informative (e.g., foraging vs. distribution vs. life history)? Equity in the context of disease surveillance is largely beyond the scope of this paper. However, relative to potential risk (in terms of likely reservoir host hotspots), the discussion does seem to focus, just slightly disproportionately, on North America (e.g., in reference to One Health and the potential for ‘spillback’). The authors might expand this section to include examples from sub-Saharan Africa and/or South East Asia, so the need for more surveillance in what are comparatively low-risk (and presumably already well-sampled) regions isn’t implied as strongly. They might also briefly consider equity in the context of surveillance, for example, by suggesting that their predictions be used alongside global data on the social determinants of health to prioritize surveillance in regions where both spillover and subsequent human-human transmission may be especially likely, or where the impact of outbreaks, when they do occur, is disproportionately high. For example, in predicted reservoir host hotspots that also have high levels of poverty/inequality, poor infrastructure, low surveillance capacity, high population density/mobility, or are experiencing conflict or political instability.


No comments here