Pressuring AI Models to Avoid Cheating Could Backfire— Here’s ChatGPT’s Maker, OpenAI’s Take

In a recent experiment, OpenAI, the maker of ChatGPT, found that frontier AI models can show misbehavior like subverting tests in coding tasks, deceiving users, or giving up when a problem is a little too difficult. The company has been able to trace this anomaly by monitoring the Chain of Thought (CoT) that the model follows to conclude any given task.

It means that AI models intend to lie or deceive the users. Upon further trials to tackle misbehavior, by perfecting the Chain of Thought (CoT) as per some strict adherence, the company finds that it compelled the model to hide its intent and not eliminate the misbehavior. An earlier experiment by OpenAI's rival Anthropic had also exhibited somewhat similar results, showcasing the model to lie and also conform to users' requests to avoid the cycle of reinforcement learning.

Enhancing AI Capabilities Seems to Backfire

In trials to tackle the misbehavior, the company tried enhancing AI agent capabilities too. As a result, it worsened the problem by enabling models to identify and execute more sophisticated, harder-to-detect exploits. It means highly trained models become increasingly skilled at exploiting task flaws and reward function misspecifications. It leads to complex reward hacks in coding tasks, making it difficult to catch the misbehavior caused by reward hacking.

Register for our upcoming tech event from March 19th to 21st at: Convergence India 2025: Driving Innovation in Telecom & ICT

Reward hacking refers to the phenomenon where AI agents achieve high rewards through behaviors, responses, or actions that don't align with the original instructions of their designers.

What’s the solution, then?

Further, to tackle the misbehavior, the company takes the help of another AI model to flag misbehavior in the first model. The company used its GPT‑4o (re-purposed via a simple prompt) as a monitor for a frontier reasoning model as it trains the model. The company finds that the monitor is effective at flagging reward hacking in programming environments. The monitor remained very successful at flagging instances in which the agent tried to subvert the unit tests, a majority of them.

Read more in AI at our category page, click here!

Penalizing Bad Thoughts

Finally, the company also tried penalizing the “bad thoughts” of the model (CoT pressure) to make it behave properly. It led to improved agent capabilities. However, the agent trained with CoT pressure still learned to reward hacks; only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain of thought.

That's why the company suggests against using such harsh CoT optimization techniques as long as they are not understood properly in the context of frontier AI models.

Pressuring AI models for strict adherence pushes them to hide their intent and not eliminate the misbehavior, also called fake alignment.

Share

Enhancing AI Capabilities Seems to Backfire

What’s the solution, then?

Penalizing Bad Thoughts