The Linux Foundation, the nonprofit advancing professional open source management for mass collaboration, today announced the Community Data License Agreement (CDLA) family of open data agreements. In an era of expansive and often underused data, the CDLA licenses are an effort to define a licensing framework to support collaborative communities built around curating and sharing “open” data.

Inspired by the collaborative software development models of open source software, the CDLA licenses are designed to enable individuals and organizations of all types to share data as easily as they currently share open source software code. Soundly drafted licensing models can help people form communities to assemble, curate and maintain vast amounts of data, measured in petabytes and exabytes, to bring new value to communities of all types, to build new business opportunities and to power new applications that promise to enhance safety and services.

The growth of big data analytics, machine learning and artificial intelligence (AI) technologies has allowed people to extract unprecedented levels of insight from data. Now the challenge is to assemble the critical mass of data for those tools to analyze. The CDLA licenses are designed to help governments, academic institutions, businesses and other organizations open up and share data, with the goal of creating communities that curate and share data openly.

For instance, if automakers, suppliers and civil infrastructure services can share data, they may be able to improve safety, decrease energy consumption and improve predictive maintenance. Self-driving cars are heavily dependent on AI systems for navigation, and need massive volumes of data to function properly. Once on the road, they can generate nearly a gigabyte of data every second. For the average car, that means two petabytes of sensor, audio, video and other data each year.

Similarly, climate modeling can integrate measurements captured by government agencies with simulation data from other organizations and then use machine learning systems to look for patterns in the information. It’s estimated that a single model can yield a petabyte of data, a volume that challenges standard computer algorithms, but is useful for machine learning systems. This knowledge may help improve agriculture or aid in studying extreme weather patterns.

And if government agencies share aggregated data on building permits, school enrollment figures, sewer and water usage, their citizens benefit from the ability of commercial entities to anticipate their future needs and respond with infrastructure and facilities that arrive in anticipation of citizens’ demands.

“An open data license is essential for the frictionless sharing of the data that powers both critical technologies and societal benefits,” said Jim Zemlin, Executive Director of The Linux Foundation. “The success of open source software provides a powerful example of what can be accomplished when people come together around a resource and advance it for the common good. The CDLA licenses are a key step in that direction and will encourage the continued growth of applications and infrastructure.”

CDLA Licenses Promote Sharing While Reducing Risk

The Linux Foundation, in collaboration with a broad set of participating organizations, drafted the CDLA licenses with the needs of companies, organizations and communities that have valuable data assets such as these to share. The intention of the licenses is for contributors and consumers of open datasets to actively use and support the contribution of data in a uniform fashion, while clarifying the terms of that sharing and reducing risk.

There are two CDLA licenses: a Sharing license that encourages contributions of data back to the data community and a Permissive license that puts no additional sharing requirements on recipients or contributors of open data. Both encourage and facilitate the productive use of data. A few commercial and community implications of the licenses include:

  • Data producers can share with greater clarity about what recipients may do with it. Data producers can also choose between Sharing and Permissive licenses and select the model that best aligns with their interests. In either case, data producers should enjoy the clarity of recognized terms and disclaimers of liabilities and warranties.
  • Data communities can standardize on a license or set of licenses that provide the ability to share data on known, equal terms that balance the needs of data producers and data users. Data communities have a high degree of flexibility to add their own governance and requirements for curating data as a community, particularly around areas such as personally identifiable information.
  • Data users who are looking for datasets to help kick off training an AI system or for any other use will have the ability to find data shared under a known license model with terms that clearly state their rights and responsibilities.

The CDLA is data privacy agnostic and relies on the publisher and curators of the data to create their own governance structure around what data they curate and how. Each producer or curator of data will have to work through various jurisdictional requirements and legal issues.

Broad Support for the CDLA

“Data is the oil of the 21st century,” said Mark Radcliffe, Partner and Global Chair of the FOSS Practice Group at DLA Piper. “Yet, the legal protection for and licensing of data is in its infancy. Many current licenses take a variety of inconsistent (and frequently incomplete) approaches to the use and licensing of data. The CDLA provides a valuable tool for companies and lawyers in managing the use and licensing of data. In the best tradition of the open source community, The Linux Foundation used a collaborative process to get the best possible agreement. I will be using the CDLA for many of my clients.”

“We see the CDLA as being in the forefront of encouraging a shift in how people view data,” said Todd Moore, Vice President, Open Technology at IBM. “Given the growing volume of available datasets, data by itself is no longer the primary source of value. Instead, the ecosystem around the development and mining of trends and insights derived from data holds far more value for society. The CDLA provides the right platform to enable this shift in view and we are excited to be an early adopter.”

“Data is replacing concrete as the foundation of 21st century transportation, and knitting this increasingly complex array of public and private data sources together requires new approaches to data licensing and data governance,” said Kevin Webb, Executive Director, Open Transport Partnership. “The CDLA provides a critical new tool to facilitate collaboration and data sharing between government and private sector innovators.”

“Shared data licensing will do for machine learning and the next phase of information technology evolution what the GNU General Public License and the free software ethos it embodied did for primary software production over the last generation,” said Eben Moglen, Professor of Law at Columbia Law School and founding director of the Software Freedom Law Center. “Clearly expressed, well-designed rules for ‘share alike’ treatment of collaboratively-produced data will enable massive cooperation and help us resist over-concentrated ownership of the resource most crucial to 21st century social and economic development.”

To learn more, go to