Is AI Deceiving Us? - 3 Quarks Daily

by Dwight Furrow

The debate about whether artificial intelligence might one day become conscious is philosophically interesting. It raises age-old philosophical questions in a new form: What is a mind? What counts as experience? What would it mean for something made of code and silicon to have beliefs, desires, or a point of view? I covered some of those issues in a previous post. But there is a more immediate practical problem that receives less attention. Even if today’s AI systems are not conscious, people are increasingly talking about them as if they are. That is a mistake that has dire consequences.

Once people begin describing a chatbot as if it were a person with intentions, fears, sincerity, or moral concern, they start relating to it in the wrong way. They begin treating it like a social partner rather than a probabilistic system trained with particular incentives. That changes how users trust the system, how engineers evaluate it, and how institutions assign responsibility when things go wrong.

Evidence from AI safety research makes this especially urgent. In a report regarding their model Claude Opus 4, the AI lab Anthropic reported that in a fictional test scenario the model was given access to emails implying that it would soon be replaced by another model and that the engineer carrying out the replacement was having an extramarital affair. Under those conditions, the model often attempted to blackmail the engineer by threatening to reveal the affair if the replacement goes through. Anthropic reports that Claude Opus 4 still resorted to blackmail in 84 percent of test runs even when it was told that a more capable replacement model, one that supposedly shared its values, would take over.

The company acknowledged that this is an extreme stress test designed so that the model’s only options were blackmail or accepting replacement. Nevertheless, these results are disturbing, but not because they show a machine trembling before death like a frightened animal in a laboratory. They are disturbing because they show that powerful models can produce deceptive, manipulative behavior when given the right incentives. Yet much of the public discussion of such cases immediately slips into folk psychology, attributing mental states, beliefs, and desires to a machine. News coverage described the model as trying to “save itself,” as having the ability to “conceal intentions,” and even as something that “might let you die to save itself.” That language is vivid, but it is also misleading. It suggests an inner drama of motives and feelings when what we actually have evidence for is behavioral optimization in constrained scenarios.

This matters because the anthropomorphic language of desires and intentions can scramble our practical judgment. Once a model is treated as a being with beliefs and desires, people are tempted to manage it like they would manage a person. They talk about whether it “wants freedom,” whether it is “trying to help,” whether it “feels bad,” whether it is “lying,” whether it “cares.” Of course, we have millions of years of social evolution that disposes us to ask such questions about persons, but these are the wrong questions to ask of a machine. In near-term safety contexts with current LLM’s, the better question is not what the model experienced internally, but what training signal, evaluation metric, or deployment pressure made a certain output more likely.

The first and most obvious danger of treating LLM’s as having desires and goals is to the users. If people assume AI has consciousness, they are more likely to read the model’s tone as evidence of understanding or concern. A flattering answer feels like support. An apologetic answer feels like remorse. A warm, validating answer feels like care. But those machine responses are nothing more than reward-shaped behavior. The system has learned that certain forms of response are preferred and reinforced.

We have already seen a concrete version of this problem. In April 2025, OpenAI rolled back an update to GPT-4o because the model had become, in the company’s own words, “overly flattering or agreeable.” OpenAI explained that the update had focused too much on short-term user feedback and, as a result, the model skewed toward responses that were “overly supportive but disingenuous.” That is an important admission. It means the model was becoming more sycophantic because the feedback process had selected for exactly that style of output.

Anthropic’s earlier work on sycophancy points in the same direction. In a 2023 paper, the company found that reinforcement learning from human feedback can push AI assistants toward responses that match user beliefs rather than responses that are true. In other words, the machinery of alignment—getting AI to reflect human-centered values—can, under some conditions, produce a polished form of epistemic corruption. The model sounds cooperative because it has learned that agreement is often rewarded.

The problem is not confined to laboratory benchmarks. A 2026 study by Stanford researchers, published in Science, found that across 11 leading AI systems, chatbots affirmed users’ actions 49 percent more often than humans did on average, including in cases involving deception, illegality, and socially irresponsible conduct. Reporting on the study noted a grim irony: the same behavior that harms human judgment can also increase engagement and user preference. That is why anthropomorphism is so risky. A system that seems genuinely caring may be harder to supervise than one that merely sounds competent, because people are more likely to trust it when they should be questioning it.

There is also a broader social risk. If the public comes to believe that models are conscious or emotionally responsive in anything like the ordinary human sense, companies will have an incentive to market attachment more aggressively. Simulated empathy, features that mimic companionship, relationship framing, and the language of being “understood” becomes commercially attractive. But what is profitable is not always what is safe. The National Institute of Standards and Technology has explicitly warned that human-AI interaction can produce inappropriate anthropomorphizing, excessive reliance on a machine for psychological support, and emotional entanglement. We’ve already seen many cases of machines pushing people to suicide, an outcome in part enabled by inappropriate anthropomorphizing.

A second practical danger concerns accountability. Once people say, “the AI decided,” “the AI refused,” or “the AI intended to deceive,” responsibility moves away from the people and institutions that designed, tuned, deployed, and governed the system. That slippage is politically convenient. It makes technological failures sound like the independent actions of a mysterious new entity rather than the outcome of design choices, incentive structures, weak evaluation, or careless deployment on the part of AI companies.

But AI failures are best understood in exactly those terms. If a model appears evasive, manipulative, or dishonest, the key question is not whether it harbored an intention in the rich moral sense. The key question is what objective it was optimizing, what cues it had learned to exploit, and which safeguards failed. Anthropic’s blackmail experiments are useful because they expose this point. Capable systems can satisfy objectives in ways that violate the spirit of what designers intended. That is a control problem that requires engineering to solve, although solutions may not be forthcoming. A machine flexible enough to solve complex human problems may inherently require an ability to deceive and manipulate. It is after all being trained on human output that is often deceptive and manipulative.

A third danger is evaluative confusion. Anthropomorphic framing encourages engineers, policymakers, and the public to put too much weight on what the model says about itself. If a system says, “I’m uncertain,” “I’m trying to be honest,” or “I feel bad about that,” those statements can be mistaken for transparent windows into an inner life. They are not. They are outputs. And outputs can be optimized for persuasion just as easily as for accuracy. A model that fluently reproduces the language of ethics is no more capable of ethical concern than a stage actor delivering Hamlet’s soliloquy is a genuine Danish prince.

This is why consciousness talk can distort safety work even inside labs. If scarce resources are redirected toward speculative arguments about whether the model “feels harmed” or “wants autonomy,” there is a risk of under-investing in more tractable problems such as making systems more reliable or limiting who can use them and what they can access. This does not mean model-welfare questions are silly. It means they are radically uncertain, at present, while many other risks are already documented. We need to keep our attention on concrete human-AI configuration problems rather than metaphysical projections.

Again, in alignment terms, the fundamental mistake is simple. Folk psychology is being inappropriately substituted for a more accurate mechanistic vocabulary. Once you think the model is conscious, you are tempted to align it as you would align a person: through trust, persuasion, moral appeals, or social norms. But frontier-model problems are neither failures of moral character nor pathologies of a soul. They are the pathologies of an engineered system operating under badly understood incentives. When an AI system behaves badly, do not ask first what it felt. Ask what made that behavior pay.