
As every company moves to implement AI in some form or another, data is king. Without quality data to train on, the AI likely won’t deliver the results people are looking for and any investment made into training the model won’t pay off in the way it was intended.
“If you’re training your AI model on poor quality data, you’re likely to get bad results,” explained Robert Stanley, senior director of special projects at Melissa.
According to Stanley, there are a number of data quality best practices to stick to when it comes to training data. “You need to have data that is of good quality, which means it’s properly typed, it’s fielded correctly, it’s deduplicated, and it’s rich. It’s accurate, complete and augmented or well-defined with lots of useful metadata, so that there’s context for the AI model to work off of,” he said.
If the training data does not meet those standards, it’s likely that the outputs of the AI model won’t be reliable, Stanley explained. For instance, if data has the wrong fields, then the model might start giving strange and unexpected outputs. “It thinks it’s giving you a noun, but it’s really a verb. Or it thinks it’s giving you a number, but it’s really a string because it’s fielded incorrectly,” he said.
It’s also important to ensure that you have the right kind of data that is appropriate to the model you are trying to build, whether that be business data or contact data or health care data.
“I would just sort of be going down these data quality steps that would be recommended before you even start your AI project,” he said. Melissa’s “Gold Standard” for any business critical data is to use data that’s coming in from at least three different sources, and is dynamically updated.
According to Stanley, large language models (LLMs) unfortunately really want to please their users, which sometimes means giving answers that look like compelling right answers, but are actually incorrect.
This is why the data quality process doesn’t stop after training; it’s important to continue testing the model’s outputs to ensure that its responses are what you’d expect to see.
“You can ask questions of the model and then check the answers by comparing it back to the reference data and making sure it’s matching your expectations, like they’re not mixing up names and addresses or anything like that,” Stanley explained.
For instance, Melissa has curated reference datasets that include geographic, business, identification, and other domains, and its informatics division utilizes ontological reasoning using formal semantic technologies in order to compare AI results to expected results based on real world models.