There are a number of published datasets available on the web for developers and researchers to take advantage of, experiment with, and build interesting solutions from. However, just because a dataset is open and available doesn’t mean it will necessarily be useful. To make data more accessible and beneficial to the industry, Google has committed to sharing its data responsibly and is sharing insights on how others can do the same.
The company has released more than 50 open datasets for researchers, including YouTube 8M, the HDR+ Burst Photography dataset, and Open Images. “Sharing datasets is increasingly important as more people adopt machine learning through open frameworks like TensorFlow,” Google wrote in a post. “Just because data is open doesn’t mean it will be useful, however.”
To address this, Google worked to clean up those datasets and turn them into a machine-readable format in order to make them useful. “Cleaning a large dataset is no small feat; before opening up our own, we spend hundreds of hours standardizing data and validating quality,” the company wrote.
Next, it worked to make data findable and useful with its Dataset Search tool. “It’s not enough to just make good data open, though- — it also needs to be findable,” according to the company. Dataset Search helps researchers find data sources that are hosted in different locations. In the few months since the tool has been launched, the number of unique datasets on the platform has doubled, with new contributions form the National Institutes of Health, the Federal Reserve, the European Data Portal, and the World Bank.
Google also launched Data Commons, which is a knowledge graph of data sources that lets users treat various datasets of interest—regardless of source and format—as if they are all in a single local database,” Google explained. The goal of Data Commons is to reduce the amount of time spent analyzing data across multiple sources, the company explained.
It is also working to balance the benefits of sharing data with the potential trade-offs. For instance, Google said data openness may enable uses that don’t align with Google’s AI principles or can expose user proprietary information, causing privacy breaches. Google has tackled this with published search trends, federated learning, and differential privacy.
“We hope that our efforts will help people access and learn from clean, useful, relevant and privacy-preserving open data from Google to solve the problems that matter to them. We also encourage other organizations to consider how they can contribute—whether by opening their own datasets, facilitating usability by cleaning them before release, using schema.org metadata standards to increase findability, enhancing transparency through data cards or considering trade-offs like user privacy and misuse,” Google wrote.