By Benjamin Ross, Editor, AI Trends
The US National Institutes of Health (NIH)’s Office of Portfolio Analysis (OPA) has developed a machine learning model that predicts whether a scientific advance is likely to translate to the clinic. The model, described in a recent study published in PLOS Biology (DOI:https://doi.org/10.1371/journal.pbio.3000416), determines the likelihood that a research article will be cited by a future clinical trial or guideline, which OPA labels as early indicators of translational progress.
Developed by OPA Director George Santangelo and colleagues, the model qualifies predictions using a novel metric called “Approximate Potential to Translate” (APT). “We found that distinct knowledge flow trajectories are linked to papers that either succeed or fail to influence clinical research,” the study’s authors write. “Translational progress in biomedicine can therefore be assessed and predicted… based on information conveyed by the scientific community’s early reaction to a paper.”
The development of APT values comes as the NIH launches the second version of its iCite tool, a web application that provides a panel of bibliometric information for journal publications within a defined analysis group. The APT values will be freely and publicly available as new components of iCite.
Clinical research is a long, arduous process, the authors say, possibly taking decades for a discovery to translate into improvements in human health. This presents challenges when assessing and guiding the translation of the bench-to-bedside process.
Santangelo tells AI Trends that machine learning presented an opportunity to get a better read on the likelihood that papers would move into the clinic.
“The work started to develop in terms of seeing if we could find a method that would give us an earlier read on what to expect from different parts of the biomedical research landscape in terms of citation by clinical trials or guidelines as evidence that things were moving into the clinic.”
As the team began to apply their APT values to existing data, Santangelo says nuanced patterns began to emerge as key predictors for translational progress.
“I think the most important one that we focus on is the diversity of interest from across the fundamental to clinical research axis,” he says. “When people across that axis — from fundamental scientists often in the same field as the work that’s being published, all the way to people in the clinic — show an interest in the form of citations in those papers, then the likelihood of eventual citation by a clinical trial or guideline is quite high.”
These indicators have proven to be effective, Santangelo and his colleagues argue, writing that “as little as 2 years [post publication] data yield accurate predictions about a paper’s eventual citation by a clinical article.”
“We can now get a sense of what’s happening in the literature without as dramatic a censoring effect [a condition in which the value of a measurement is only partially known], which allows us to be more forward looking in understanding what areas of research are more likely to draw interest from clinically-focused scientists,” Santangelo says.
Breaking Down The Paywall
In addition to APT values, the iCite webtool will offer the NIH’s Open Citation Collection (NIH-OCC), a free public access database for biomedical research. The database currently comprises over 420 million citation links, with additional citations accumulating monthly.
Santangelo says the database offers a solution to proprietary, restrictive, and often costly licensing agreements that have been a barrier to collaborative research.
“These days there really isn’t a good justification for keeping [this data] behind a paywall, especially with the issues of data quality,” says Santangelo. “We recognized early on that, if we were publishing something, we were using a proprietary source for the raw data, and that others would not be able to calculate the values without working with us on a subset of the data. That never sat easy with us.”
The NIH-OCC offers researchers the chance to access the raw data. “There’s no better check on data quality than that,” Santangelo says.
In a Community Page article in PLOS Biology (DOI:https://doi.org/10.1371/journal.pbio.3000385), Santangelo and his co-authors say the NIH-OCC dataset has been generated from unrestricted data sources such as MedLine, PubMed Central, and CrossRef, as well as “data from a machine learning pipeline that identifies, extracts, resolves, and disambiguates authors in references from full-text articles available on the internet.”
Santangelo says there’s a standing invitation for data sharing. “We’re data sponges,” he says. “We’ll take data from wherever we can find it.”
Learn more at Office of Portfolio Analysis.