The Allen Institute for AI (AI2) today released OLMo, an open large language model designed to provide understanding around what goes on inside AI models and to advance the science of language models.
“Open foundation models have been critical in driving a burst of innovation and development around generative AI,” said Yann LeCun, chief AI scientist at Meta, in a statement. “The vibrant community that comes from open source is the fastest and most effective way to build the future of AI.”
The effort was made possible through a collaboration with the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University, along with partners including AMD, CSC-IT Center for Science (Finland), the Paul G. Allen School of Computer Science & Engineering at the University of Washington, and Databricks.
OLMo is being released alongside pre-training data and training code that, the institute said in its announcement, “no open models of this scale offer today.”
Among the development tools included in the framework is the pre-training data, built on AI2’s Dolma set that features three trillion tokens along with code that produces the training data. Further, the framework includes an evaluate suite for use in model development, complete with more than 500 checkpoints per model under the Catwalk project umbrella, AI2 announced.
“Many language models today are published with limited transparency. Without having access to training data, researchers cannot scientifically understand how a model is working. It’s the equivalent of drug discovery without clinical trials or studying the solar system without a telescope,” said Hanna Hajishirzi, OLMo project lead, a senior director of NLP Research at AI2, and a professor in the UW’s Allen School. “With our new framework, researchers will finally be able to study the science of LLMs, which is critical to building the next generation of safe and trustworthy AI.”
Further, AI2 noted, OLMo provides researchers and developers with more precision by offering insight into the training data behind the model, eliminating the need to rely on assumptions as to how the model is performing. And, by keeping the models and data sets in the open, researchers can learn from and build off of previous models and the work.
In the coming months, AI2 will continue to iterate on OLMo and will bring different model sizes, modalities, datasets, and capabilities into the OLMo family.
“With OLMo, open actually means ‘open’ and everyone in the AI research community will have access to all aspects of model creation, including training code, evaluation methods, data, and so on,” said Noah Smith, OLMo project lead, a senior director of NLP Research at AI2, and a professor in the UW’s Allen School, said in the announcement. “AI was once an open field centered on an active research community, but as models grew, became more expensive, and started turning into commercial products, AI work started to happen behind closed doors. With OLMo we hope to work against this trend and empower the research community to come together to better understand and scientifically engage with language models, leading to more responsible AI technology that benefits everyone.”