The dataset is the most extensive machine-readable Coronavirus literature collection available was created with input from researchers and leaders from the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Microsoft, and the National Library of Medicine (NLM) at the National Institutes of Health.
To create the data set, Microsoft web-scale literature curation tools were used to identify and bring together worldwide scientific efforts and results, CZI provided access to pre-publication content, NLM provided access to literature content, and the Allen AI team transformed the content into machine-readable form, making the corpus ready for analysis and study, according to a post from the White House.
RELATED CONTENT: Developers take on COVID-19 with open-source projects
Now, researchers are encouraged to submit the text and data mining tools and insights via the Kaggle platform – a machine learning and data science community owned by Google Cloud.
“One of the most immediate and impactful applications of AI is in the ability to help scientists, academics, and technologists find the right information in a sea of scientific papers to move research faster. We applaud the OSTP, WHO, NIH and all organizations that are taking a proactive approach to use the most advanced technology in the fight against COVID-19,” said Dr. Oren Etzioni, chief executive officer of the Allen Institute for AI.
Sought after insights include the natural history, transmission, and diagnostics for the virus, management measures at the human-animal interface, lessons from previous epidemiological studies, and more.
“It’s all hands on deck as we face the COVID-19 pandemic,” said Eric Horvitz, chief scientific officer at Microsoft. “We need to come together as companies, governments, and scientists and work to bring our best technologies to bear across biomedicine, epidemiology, AI, and other sciences. The COVID-19 literature resource and challenge will stimulate efforts that can accelerate the path to solutions on COVID-19.”
The CORD-19 resource is available on the Allen Institute’s SemanticScholar.org website and will continue to be updated as new research is published in archival services and peer-reviewed publications.
The creators of the dataset recommend using metadata from the comprehensive file when available, instead of parsed metadata in the dataset. Please note the dataset may contain multiple entries for individual PMC IDs in cases when supplementary materials are available.