Google is trying to make it easier to discover datasets with Dataset Search. The hope is that this project will enable data scientists, data journalists, “data geeks” and others to find the data that they need.
“In today’s world, scientists in many disciplines and a growing number of journalists live and breathe data. There are many thousands of data repositories on the web, providing access to millions of datasets; and local and national governments around the world publish their data as well,” Natasha Noy, research scientist at Google AI wrote in a post.
According to the company, Dataset Search is similar to Google Scholar in that datasets can be found no matter where they are hosted, from publisher’s sites to digital libraries to an author’s personal website.
After data is collected, Google analyzes where different versions of the dataset might exist and finds publications that may have discussed the dataset.
In creating Dataset Search, Google also developed guidelines for dataset providers that specifies how data should be described in order to make it easier for Google and other search engines to better understand the web content. The guidelines include information about datasets, such as who created it, when it was published, how data was collected, and what the terms are for using data.
The guidelines are based on the schema.org open standard for describing information. “We encourage dataset providers, large and small, to adopt this common standard so that all datasets are part of this robust ecosystem,” Noy wrote.
The release of Dataset Search contains most datasets in the environmental and social sciences, in addition to data from other disciplines such as government data and data provided by news organizations. “As more data repositories use the schema.org standard to describe their datasets, the variety and coverage of datasets that users will find in Dataset Search, will continue to grow,” Noy wrote.
Dataset Search already support multiple languages, but Google will be expanding support for additional languages soon.
“This launch is one of a series of initiatives to bring datasets more prominently into our products. We recently made it easier to discover tabular data in Search, which uses this same metadata along with the linked tabular data to provide answers to queries directly in search results. While that initiative focused more on news organizations and data journalists, Dataset search can be useful to a much broader audience, whether you’re looking for scientific data, government data, or data provided by news organizations,” Noy wrote.