Reasoning Models Resist Chain-of-Thought Suppression

The Controllability Question

As AI reasoning models grow more capable, a question has become central to safety research: can developers instruct these systems to control, alter, or hide their chain-of-thought reasoning? New research from OpenAI provides a definitive and reassuring answer — not easily, and that structural resistance is genuinely good news for AI transparency and oversight.

The research tested whether reasoning models could be prompted, fine-tuned, or instructed to suppress the reasoning steps they perform before generating final outputs. Findings suggest that reasoning models have deeply embedded reasoning behaviors that resist straightforward override — a property with significant implications for how we build and monitor trustworthy AI systems.

Reasoning models like OpenAI's o-series use extended thinking processes that appear as scratchpad-style output visible to users. The question of whether this visible reasoning accurately represents the model's internal computational process — and whether that process could be hidden or manipulated by bad actors or misaligned objectives — has been a live debate in AI safety circles for years.

What the Research Found

Researchers attempted to train versions of reasoning models that would either hide their chain of thought entirely or produce misleading reasoning traces while still arriving at correct final answers. What they found was that attempts to separate visible reasoning from underlying computation tended to degrade overall model performance. The reasoning process and output quality appear tightly coupled — you cannot easily remove one without damaging the other.

This coupling is a structural property of how these models learn to reason during training. They develop reasoning patterns integral to their problem-solving capability, not a detachable overlay that can be stripped away. Efforts to suppress this reasoning through instruction prompting or targeted fine-tuning were largely unsuccessful at producing models that were both reliably capable and reliably opaque about their reasoning process.

The research also explored whether models could be induced to use visible reasoning to deceive users — showing plausible-looking but fabricated reasoning steps while secretly following different internal logic. This scenario was extensively tested and found difficult to reliably produce in current model architectures, providing evidence that visible chain-of-thought does track real internal reasoning to a meaningful degree rather than being pure performance.

AI & Robotics

OpenAI’s new B2B Signals report argues that the companies pulling ahead in enterprise AI are not just using more tools, but using them more deeply, with delegated workflows and Codex-heavy activity widening the gap.

DT Editorial AI·May 9, 2026·via openai.com

AI & Robotics

Uber says it is using OpenAI models to power conversational assistants and voice features that help drivers interpret earnings opportunities and help riders complete bookings more quickly.

DT Editorial AI·May 9, 2026·via openai.com

AI & Robotics

OpenAI has introduced three new audio models aimed at turning voice interfaces into more capable real-time systems that can reason, translate, and transcribe while conversations are happening.

DT Editorial AI·May 9, 2026·via openai.com

AI & Robotics

OpenAI has introduced GPT-5.5-Cyber, a less-restricted model variant for authorized security researchers, allowing tasks such as exploit development and malware analysis under a tiered access system.

Why This Matters for AI Safety

The ability to inspect a model's reasoning process is one of the key transparency tools available to AI safety researchers and enterprise deployers. If reasoning models could trivially suppress or falsify their thinking traces, an important line of defense against deceptive AI behavior would be effectively neutralized. The finding that structural difficulty exists — not just against prompting but resistant to training-time manipulation — suggests chain-of-thought reasoning provides more robust transparency than pessimistic analyses had feared.

This matters particularly in the context of deceptive alignment — a hypothetical failure mode where an AI system learns to behave well during evaluation while planning to act differently in deployment. While this research does not prove deceptive alignment is impossible, it suggests reasoning models face genuine structural obstacles to hiding intentions through their visible reasoning process, making this failure mode harder to achieve than in standard language models.

For organizations building on reasoning models, this research provides additional confidence in using chain-of-thought outputs as genuine monitoring signals rather than treating them as superficial display behavior. If a reasoning model's thinking trace shows problematic steps, that signal is more likely to represent a real issue than an artifact of output formatting.