How AI helps scientists find reliable coronavirus research

As the world unites in the fight against COVID-19, scientists and researchers around the world are studying the novel coronavirus and publishing their findings in peer-reviewed journals and pre-print servers.

Scattered across these research papers might be the pieces of the puzzle that will unlock the cure or vaccine for COVID-19 or new ways to treat patients and prevent the spread of the virus. Unfortunately, no single person can go through tens of thousands of documents, and the thousands more that are being added every week.

This is where the artificial intelligence community enters the scene. Among other efforts to help fight the coronavirus pandemic, AI researchers are fast busy developing tools that will help medical scientists navigate the fast-growing corpus of literature surrounding coronavirus.

Read: [Researchers want your voice to train coronavirus-detecting AI]

The concerted effort to process COVID-19 papers, which has brought together government agencies, tech giants, universities, and research labs, will be a measure of how useful our state-of-the-art AI algorithms have become.

TNW Conference - The 2025 Agenda has just touched down

Discover the insightful and dare we say controversial sessions that will take place June 19-20.

Check It Out

The CORD-19 dataset

In March, the U.S. government teamed up with tech giants Microsoft and Google to gather research papers about COVID-19. The corpus was compiled into a dataset named COVID-19 Open Research Dataset (CORD-19) by the Allen Institute for AI (AI2) in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine at National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.

CORD-19 was released in mid-March and made accessible to AI researchers to use it to create machine learning models that can help scientists find the information they need.

The initial dataset included over 24,000 research papers from peer-reviewed publications as well as pre-print servers such as bioRxiv and medRxiv. It has since grown to more than 47,000 documents since.

CORD-19 is available on AI2’s Semantic Scholar website, a search engine for peer-reviewed research. Machine learning researchers can download the database from Semantic Scholar. The corpus has also been integrated into the search engine and can be queried through Semantic Scholar.

AI2 has also launched the CORD-19 Explorer, a full-text search engine specialized for the COVID-19 research corpus. The Explorer also has links to other relevant tools. Some of them have been built on CORD-19, such as this search engine that uses Microsoft Azure’s Cognitive Search. Other tools are based on other data sources, such as the Elsevier Coronavirus Research Repository. You’ll also find a link to COVID-19 Cognitive City, a social network focused on stopping the spread of coronavirus.

The Kaggle challenge

coronavirus (covid-19) — Image credit: Depositphotos

Semantic Scholar and Google Scholar, which also consolidates relevant research papers, are already powerful tools for searching the corpus of knowledge generated on COVID-19. Semantic Scholar uses transformers, the state-of-the-art in natural language processing (NLP). Google has also added BERT, an implementation of transformers, in a recent update to its search engine.

The community, however, is interested to know if they can push the limits of current AI algorithms and exploit them to further help scientists in their fight against COVID-19.

Following the release of CORD-19, Kaggle, the Google-owned hub for data science and machine learning competitions, launched the COVID-19 Open Research Dataset Challenge. “We are issuing a call to action to the world’s artificial intelligence experts to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions,” the challenge’s description reads.

To be able to measure progress and success, the challenge has been broken down into a list of 10 tasks that can help better understand new information about COVID-19, patient care, and cure development.

For instance, one task involves non-pharmaceutical interventions. The AI that tackles this task should be able to peruse the dataset and find papers that discuss NPIs and their effectiveness, such as how travel bans and school closures are helping in flattening the COVID-19 curve. Another task involves gathering the latest findings on COVID-19 risk factors.

Results should include complementary information such as the strength of the evidence found in the studies, which can help in the decision-making process.

“Findings should be focused, concise, extract quotes and numbers out of papers and also provide a link to the underlying source,” Anthony Goldbloom, Kaggle’s CEO, has written in an advisory on the CORD-19 challenge.

As of this writing, there have been more than 730 contributors to the CORD-19 Challenge.

Where does AI technology stand today

machine learning natural language processing

The tasks included in the CORD-19 Challenge are very practical tasks, and the results will directly affect our response to the coronavirus pandemic. But one thing to note is that we can’t expect miracles from contemporary artificial intelligence technologies.

Language processing is perhaps the most challenging subfield of AI and the most complex functions of the human brain, the one thing that sets us from other living beings. According to many experts, the problem of language processing will remain unsolved until we create artificial general intelligence, the kind of AI that has human-level abstraction, reasoning, and problem-solving capabilities. And by many accounts, we are at least decades away from general AI.

For the moment, our most advanced NLP models rely on deep learning and artificial neural networks. Neural networks are very efficient statistical models that can find recurring patterns in large sequences of data. Deep learning models like transformers, now used in most advanced language models, can operate on very large corpora of text and answer queries in ways that were beyond the capabilities of previous artificial intelligence algorithms.

However, when it comes to extracting the implied meanings that are often omitted in written and spoken language, even the most sophisticated AI algorithms struggle. We still don’t have AI that can understand and process human language as efficiently as a seven-year-old child.

But the silver lining is that this particular challenge involves a very narrow field of research. As opposed to general natural language understanding, the CORD-19 Challenge has a very specific requirement: Searching for information about one virus and one disease.

While current AI systems lack in general problem-solving, they’re very good at dealing with narrow domains, often performing even better than humans. In fact, according to Goldbloom, “Some of the most impactful work so far have involved simple methods like string matching and regular expressions.” String-matching and regular expressions are not even considered AI today.

Another factor that provides hope is the quality of the information. One of the challenges of machine learning is gathering and cleaning the data used in training the models. In this case, there’s a concerted effort by the entire community and a lot of manual and automated effort is going into making sure that we have a consolidated body of reliable documents for research.

So we probably can’t expect the emergence of an AI system that can read and understand every document like a human scientist would. Past efforts at creating such AI systems have failed, and there hasn’t been any fundamental breakthrough to show hope for a change in this regard.

But what we can expect is the development of very specialized AI-powered search tools that will help our scientists find relevant bits in the growing sea of information published on COVID-19. As long as you know which questions to ask—and the people using these systems certainly do—you’ll be able to obtain very quality information.

As A12 CEO Oren Etzioni wrote in Wired last week, “While the jury is still out on AI’s contributions in the coming weeks, it’s clear that the AI community has enlisted to fight Covid-19. It is ironic that the AI which has caused such consternation with facial recognition, deepfakes, and such is now at the front lines of helping scientists confront Covid-19 and future pandemics… Our use of AI to fight Covid-19 reminds us that AI is a tool, not a being, and it’s up to us to employ this tool for the common good.”

This story is republished from TechTalks, the blog that explores how technology is solving problems… and creating new ones. Like them on Facebook here and follow them down here:

Story by Ben Dickson

Ben Dickson is the founder of TechTalks. He writes regularly about business, technology and politics. Follow him on Twitter and Facebook (show all) Ben Dickson is the founder of TechTalks. He writes regularly about business, technology and politics. Follow him on Twitter and Facebook

Get the TNW newsletter

Get the most important tech news in your inbox each week.

How AI helps scientists find reliable coronavirus research

The CORD-19 dataset

The Kaggle challenge

Where does AI technology stand today

Get the TNW newsletter

Also tagged with

3 ways ‘algorithmic management’ makes work more stressful and less satisfying

AI can’t tell if you’re lying – anyone who says otherwise is selling something

Discover TNW All Access

Hybrid work isn’t perfect, but SCIENCE can help us improve it

AIs could become reward junkies — and experts are worried