IBM Research has created AGENT, a benchmark for evaluating an AI model’s core psychological reasoning ability, or common sense, to help users build and test AI models that reason similar to how humans do.
“We’re making progress toward building AI agents that can infer mental states, predict future actions, and even work with human partners. However, we lack a rigorous benchmark for evaluating an AI model’s core psychological reasoning ability — its common sense,” Abishek Bhandwaldar, research software engineer at IBM, and Tianmin Shu, a postdoc at MIT, wrote in a blog post.
AGENT is used to challenge two baseline models and evaluate their performance using a generalization-focused protocol developed at IBM.
The results show that the benchmark is useful for evaluating the core psychological reasoning ability of any AI model.
AGENT is a large-scale dataset of 3D animations of an agent moving under various physical constraints and interacting with various objects, according to IBM Research.
“The videos comprise distinct trials, each of which includes one or more ‘familiarization’ videos of an agent’s typical behavior in a certain physical environment, paired with ‘test’ videos of the same agent’s behavior in a new environment, which are labeled as either ‘expected’ or ‘surprising,’ given the behavior of the agent in the corresponding familiarization videos,” Bhandwaldar and Shu wrote.
The trials assess a minimal set of key common-sense concepts considered to be part of the core psychology in young children and are grouped into four scenarios: goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs.
More details on how the new benchmark works are available here.