IBM wants to help developers and data scientists answer important COVID-19 questions. The company’s Center for Open-Source and AI Technologies (CODAIT) has announced COVID notebooks, a toolkit that enables users to make actionable plans based on the data.
“A near-constant flow of data from research studies, news outlets, social media, and health organizations make the task of analyzing data into useful action nearly impossible. Developers and data scientists need answers to their questions about data sources, tools, and how to draw meaningful and statistically valid conclusions from the ever-changing data,” Fred Reiss, chief architect at IBM’s CODAIT, wrote in a blog post.
The project handles some mundane tasks such as obtaining authoritative data about the outbreak, cleaning up serious data-quality problems, collating data, and building a set of example reports and graphs. “Taking care of these tasks frees developers and data scientists to focus on advanced analysis and modeling tasks instead of worrying about things like data formats and data cleaning. Our repository uses developer-friendly Jupyter notebooks to cover each of these initial data analysis steps,” Reiss wrote.
According to IBM, it’s extremely challenging for data scientists and developers to answer important questions such as what regions are the most affected or what can we tell from the patterns because the data is changing daily. The toolkit enables users to update data and notebooks frequently with Elyra Notebook Pipelines VIsual Editor and KubeFlow Pipelines. The project will also include authoritative data sources from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, the New York Times Coronavirus (Covid-19) Data in the United States repository, European Centre for Disease Prevention and Control’s data on the geographic distribution of COVID-19 cases worldwide, and more. Additionally, the notebooks within the repository are Jupyter notebooks, and the company uses common Python data libraries such as Pandas, Numpy, Matplotlib, seaborn and scipy.optimize.