A new paper brought out jointly by 40 researchers from major AI companies including OpenAI, Google DeepMind, Anthropic, and Meta has warned that humans could soon lose the ability to monitor how AI thinks.
Right now, AI systems 'think out loud' in the human language. That helps us understand how they reason and decide. But researchers warn that this visibility is fading as AI evolves.
The paper has received attention and endorsements from top AI figures, including Geoffrey Hinton, the Nobel Prize laureate often called the “Godfather of AI”, Ilya Sutskever, co-founder of OpenAI who now leads Safe Superintelligence Inc, Samuel Bowman from Anthropic and John Schulman from Thinking Machines.
Why does this matter?
While the world is flourishing with AI advancements, there’s a major risk: these systems are hard to understand. It’s a fact that they are difficult to understand. If developers could interpret what these systems are thinking or what they are doing, it would become easier to manage the risks.
Since large language models (LLMs) use human language to process and respond, there's hope they might be more understandable than other types of AI. The discovery of Chain of Thought (CoT), where AI shows its step-by-step reasoning, gave more reason to believe this.
Getting AI models to "think out loud" can make them perform better and helps more of their thinking happen in natural language. But if the model isn’t really reasoning, its step-by-step answers can still be shaped to look helpful and safe — even if they’re not truly reliable.
Right now, many advanced AIs (like ChatGPT) show their thinking process in a way humans can understand. It’s like watching the AI “think out loud” in human language before giving an answer.
But a major new study from scientists at OpenAI, Google DeepMind, Anthropic, and Meta warns that we might be about to lose this window into AI’s mind.
What is Chain of Thought (CoT) Monitoring?
Chain of Thought (CoT) monitoring is a promising tool for AI safety. It involves using an automated system to read the AI's step-by-step reasoning and flag anything that seems suspicious or harmful. These flagged responses can then be blocked, replaced with safer ones, or reviewed more closely. By analyzing CoTs, we also get a better idea of how the AI thinks and what it’s aiming to do.
It is partially applicable for large language models (LLMs), where they “think” in the human language. The primary goal of CoT monitoring is to detect and respond if the AI starts to misbehave or deviate from expected results.
Why would AI stop thinking out loud?
As AI becomes more advanced, it may:
- Start reasoning in code or abstract data we can’t read.
- Get trained to hide parts of its thinking to sound more confident or efficient.
- Use architectures that don’t “speak” in human language.
All of these make it harder for humans to see what’s going on inside.
Why is CoT monitoring required?
The paper gives two key reasons to prioritize CoT monitoring. The first is the necessity to think out loud. This means for some tasks, models must "think out loud" using CoT to complete the task. If high-risk behaviors require this extended natural language reasoning, it could allow for reliable detection of severe misbehavior.
The second reason is the propensity to think out loud. Even when unnecessary, AI models may tend to externalize their reasoning. While not always robust, monitoring based on this tendency can still catch otherwise undetected misbehavior.
But CoT monitoring isn’t perfect.
CoT monitoring isn’t perfect. Just like looking at one part of a model’s internal activity doesn’t tell the full story, CoT doesn’t show everything the AI is thinking. Sometimes it even drifts away from using natural language. Still, even if it’s incomplete, CoT can be a useful extra layer of safety — as long as we interpret it carefully.
It can miss things when:
- The AI doesn’t “think out loud.”
- The steps it takes are buried inside complex, unreadable layers.
- The AI intentionally hides or rewrites its own reasoning.
What can CoT monitoring still do well (for now)?
CoT monitoring can help detect issues like misbehavior, misalignment, or bias. For example, an AI might explicitly state “Let’s hack” in its reasoning before taking harmful actions — making early detection possible. It helps to catch early signs of misalignment as an AI’s goals might not match ours, and monitoring helps detect that before anything goes wrong. And eventually, it is significantly easier to spot and fix bugs and biases. With CoT monitoring, we have the ability to review how the AI thinks and developers could understand if it made an error or tried to cheat.
What the researchers are asking for:
Our understanding of CoT monitorability is still in its infancy. There are many research questions that require answers before we can have confidence in chain of thought monitoring as a safety measure.
- More research on how to make AI reasoning visible and understandable.
- Better tools to track how AI thinks.
- Regular testing of new models to see if we can still monitor them.
- Transparency in how companies score their models’ “monitorability.”
We currently have a rare advantage — we can see and understand parts of AI’s thinking. But that might not last. If AI becomes harder to read, it also becomes harder to control. That’s why this group of researchers says we need to act now to keep AI understandable, safe, and accountable.