Organizations are getting caught up in the hype cycle of AI and generative AI, but in so many cases, they don’t have the data foundation needed to execute AI projects. A third of executives think that less than 50% of their organization’s data is consumable, emphasizing the fact that many organizations aren’t prepared for AI.
For this reason, it’s critical to lay the right groundwork before embarking on an AI initiative. As you assess your readiness, here are the primary considerations:
- Availability: Where is your data?
- Catalog: How will you document and harmonize your data?
- Quality: Having good quality data is key to the success of your AI initiatives.
AI underscores the garbage in, garbage out problem: if you input data into the AI model that’s poor-quality, inaccurate or irrelevant, your output will be, too. These projects are far too involved and expensive, and the stakes are too high, to start off on the wrong data foot.
The importance of data for AI
Data is AI’s stock-in-trade; it is trained on data and then processes data for a designed purpose. When you’re planning to use AI to help solve a problem – even when using an existing large language model, such as a generative AI tool like ChatGPT – you’ll need to feed it the right context for your business (i.e. good data,) to tailor the answers for your business context (e.g. for retrieval-augmented generation). It’s not simply a matter of dumping data into a model.
And if you’re building a new model, you have to know what data you’ll use to train it and validate it. That data needs to be separated out so you can train it against a dataset and then validate against a different dataset and determine if it’s working.
Challenges to establishing the right data foundation
For many companies, knowing where their data is and the availability of that data is the first big challenge. If you already have some level of understanding of your data – what data exists, what systems it exists in, what the rules are for that data and so on – that’s a good starting point. The fact is, though, that many companies don’t have this level of understanding.
Data isn’t always readily available; it may be residing in many systems and silos. Large companies in particular tend to have very complicated data landscapes. They don’t have a single, curated database where everything that the model needs is nicely organized in rows and columns where they can just retrieve it and use it.
Another challenge is that the data is not just in many different systems but in many different formats. There are SQL databases, NoSQL databases, graph databases, data lakes, sometimes data can only be accessed via proprietary application APIs. There’s structured data, and there’s unstructured data. There’s some data sitting in files, and maybe some is coming from your factories’ sensors in real time, and so on. Depending on what industry you’re in, your data can come from a plethora of different systems and formats. Harmonizing that data is difficult; most organizations don’t have the tools or systems to do that.
Even if you can find your data and put it into one common format (canonical model) that the business understands, now you have to think about data quality. Data is messy; it may look fine from a distance, but when you take a closer look, this data has errors and duplications because you’re getting it from multiple systems and inconsistencies are inevitable. You can’t feed the AI with training data that is of low quality and expect high-quality results.
How to lay the right foundation: Three steps to success
The first brick of the AI project’s foundation is understanding your data. You must have the ability to articulate what data your business is capturing, what systems it’s living in, how it’s physically implemented versus the business’s logical definition of it, what the business rules for it are..
Next, you must be able to evaluate your data. That comes down to asking, “What does good data for my business mean?” You need a definition for what good quality looks like, and you need rules in place for validating and cleansing it, and a strategy for maintaining the quality over its lifecycle.
If you’re able to get the data in a canonical model from heterogeneous systems and you wrangle with it to improve the quality, you still have to address scalability. This is the third foundational step. Many models require a lot of data to train them; you also need lots of data for retrieval-augmented generation, which is a technique for enhancing generative AI models using information obtained from external sources that weren’t included in training the model. And all of this data is continuously changing and evolving.
You need a methodology for how to create the right data pipeline that scales to handle the load and volume of the data you might feed into it. Initially, you’re so bogged down by figuring out where to get the data from, how to clean it and so on that you might not have fully thought through how challenging it will be when you try to scale it with continuously evolving data. So, you have to consider what platform you’re using to build this project so that that platform is able to then scale up to the volume of data that you’ll bring into it.
Creating the environment for trustworthy data
When working on an AI project, treating data as an afterthought is a sure recipe for poor business outcomes. Anyone who is serious about building and sustaining a business edge by developing and using AI must start with the data first. The complexity and the challenge of cataloging and readying the data to be used for business purposes is a huge concern, especially because time is of the essence. That’s why you don’t have time to do it wrong; a platform and methodology that help you maintain high-quality data is foundational. Understand and evaluate your data, then plan for scalability, and you will be on your way to better business outcomes.