The Paradox of Contemporary AI: Engineering Success and Institutional Failure

by William Benzon

OpenAI released GPT-3 in June of 2020, though not to the general public. Those who had access to it were variously impressed or stunned: Is this it? Is AI really coming at long last? Though I was not in the direct access circle, I had indirect access and was deeply impressed, using that access in an interview with Hollis Robbins that I published in 3QD on July 20, 2020: An Electric Conversation with Hollis Robbins on the Black Sonnet Tradition, Progress, and AI, with Guest Appearances by Marcus Christian and GPT-3. That August I issued a working paper: GPT-3: Waterloo or Rubicon? Here be Dragons. I stuck this at the top of the first page:

GPT-3 is a significant achievement.

But I fear the community that has created it may, like other communities have done before – machine translation in the mid-1960s, symbolic computing in the mid-1980s – triumphantly walk over the edge of a cliff and find itself standing proudly in mid-air.

This is not necessary and certainly not inevitable.

As far as I can tell, the AI industry is now well out over the edge of that cliff. While I see prospects for a brilliant and productive long-term future, I fear that the next decade or two will be chaotic.

Here’s how ChatGPT put it to me in a recent dialog:

The engineering success is real. LLMs [large language models] and related systems have given us access to a new conceptual continent. They work, and at extraordinary scale. But the institutional failure lies in the monoculture: too much intellectual, financial, and training-path dependence on one family of architectures and one style of thought about intelligence. The result is that we are building out the utility before we have adequately explored the space of possible successor technologies or developed the conceptual tools needed to understand what these systems are revealing about language, cognition, and cultural structure.

The paradox we are facing, then, is that the unexpected and extravagant success of the transformer technology has resulted in a furious maelstrom of entrepreneurial, investment, and commercial activity that may inhibit the further development of that technology.

The story is a complicated one, but I can convey its basic shape with two analogies. The European discovery, exploration, and settling of North America can give a rough sense of where we are and how far we have to go in the development and use of AI technology. The structure of knowledge and skills involved in whaling, seamanship vs. whale hunting, can give a rough sense of the institutional failing.

Discovering and Exploring a New World

When Columbus set sail he was expecting to reach India. When he struck land in the Bahamas in 1492 he was a bit south of an entirely different continent, North America. Roughly three centuries later much of the eastern part of the continent had been settled by Europeans and their descendants. Late in 1805 Lewis and Clark reached the Pacific Ocean near the mouth of Columbia River in Oregon. It took a century after that before European Americans had more or less settled the entire continent.

Let’s say that OpenAI’s release of GPT-3 corresponds to Columbus first setting sail in August of 1492 while the release of ChatGPT in late November of 2023 corresponds to Columbus landing in October of 1492. Within a relatively short time tens of millions of people had hooked up to ChatGPT. By the end of the year artificial intelligence was no longer little more than the dream of relatively small group of engineers and scientists. It had moved from the world of science fiction into what is colloquially known as the real world. Just how real it was and is, that became the subject fierce debate, and remains so. The debate has become widespread and unavoidable.

Within roughly a year and a half, as Claude noted in an article I recently published (“The Shock and the Narrowing”), “capital, talent, and institutional attention all funneled toward a single paradigm,” a computing monoculture based the kind of technology that powered the GPTs. The idea was to scale it up: more compute (memory and processing), bigger models (more parameters), and more data. Then build-out the infrastructure and ship products. Claude continues:

Google, which had invented the transformer architecture in 2017, was caught flat-footed and scrambled. Meta pivoted its AI strategy around LLMs. Microsoft integrated OpenAI’s models into its core products. A hundred startups raised money to build on top of the new foundation models. The venture capital flowing into AI, measured as a share of total U.S. deal value, went from 23% in 2023 to nearly two-thirds in the first half of 2025.

The infrastructure investment that followed is staggering by any historical standard. The four largest hyperscalers — Amazon, Google, Microsoft, and Meta — are expected to spend more than $350 billion on capital expenditures in 2025 alone, most of it AI-related.

To return to our crude analogy, this monoculture believes that all we need to do to reach the Pacific Ocean is to build more of the same. We have all the fundamental ideas we need.

One reason for this is that most of the scientists and engineers currently involved in machine learning and AI are relatively young. For them, the triumph of AlexNet in image recognition in 2012 is the setting-foot-on-a-new-land event. AlexNet’s triumph was achieved through scaling up enabled by using graphics processing units (GPUs). If that’s your starting point, then it is easy to see why you would believe that scaling up would take LLMs across the continent to the Pacific Ocean and that it’s going to happen fairly soon.

My time horizon is much longer and encompasses a wider range of technologies. I did research on Old School computational semantics in the 1970s and saw the so-called AI Winters of the 1980s. Moreover my teacher, David G. Hays, was a first generation researcher in computational linguistics in the 1950s and saw that field defunded in the mid-1960s for promising more than it delivered.

Thus I have a different view. I’m hardly alone. And some within the industry are beginning to realize that large language models like those powering ChatGPT, Claude, Gemini, Grok, and others may be limited technologies. As Claude put it in that article:

Gary Marcus has been making this argument for years, often dismissed as a contrarian. But the intellectual tide has shifted. Ilya Sutskever — the man who arguably did more than anyone else to build the scaling paradigm that produced GPT-3 and GPT-4 — left OpenAI in May 2024 and founded Safe Superintelligence, a company explicitly premised on the idea that the current path does not lead to genuine intelligence. Fei-Fei Li’s World Labs, which raised $1 billion, is pursuing spatial intelligence and world models — the ability to understand and predict the physical world, which LLMs trained on text cannot do. And Yann LeCun, in a January 2026 interview with MIT Technology Review given from his Paris apartment shortly after leaving Meta, was direct: “The breakthroughs are not going to come from scaling up LLMs. The most exciting work on world models is coming from academia, not the big industrial labs.”

That is to say, LLMs will not take us to the Pacific Ocean, much less support our efforts to settle the continent from coast to coast.

They believe, as do I, that we are going to need some fundamentally new ideas to make it across the continent. A number of ideas are already visible, but are not yet fully developed. The problem is that so much attention is focused on, and so many resources are being devoted to, the transformer-based LLM paradigm that it is not at all clear how and whether or not these other architectures can be brought to maturity. The problem does not lie entirely with industry. We must also consider the educational “pipeline,” undergraduate and graduate, that funnels people to the industry.

Whaling for Intelligence

The title of a 3QD article I published in December 2023 states the analogy: “Aye Aye, Cap’n! Investing in AI is like buying shares in a whaling voyage captained by a man who knows all about ships and little about whales.” The skills required to navigate and handle a ship are real, and they are necessary to any whaling voyage. But they are different from the skills needed to find, successfully pursue, and ultimately to kill whales and extract the oil and ambergris. While one can imagine, as an abstract possibility, that such a voyage might have happened, as far as I know, no real whaling expedition of that nature ever took place. It’s too absurd.

Contemporary AI technology, however, is a very different beast, one where a distinction of that kind can be made without, however, a deficiency of knowledge about whale behavior being recognized as such. Contemporary chatbots can generate very sophisticated texts, but you don’t need to know much of anything about language and linguistics, or cognition, much less rhetoric, discourse, structure or literary criticism, in order to create the bots. When a Gary Marcus points to the limitations of large language models, he does so because he has studied language and cognition. He has a deeper understanding of the human mind that do most of the experts who have given us the current AI revolution.

“But, wait,” you might be thinking, “don’t the engineers who design automobiles, for example, don’t they understand how cars work? Do they not grasp the thermodynamics of exploding gas inside a small chamber? Do they not grok how the reciprocating motion of the pistons can be mechanically transformed into rotary motion and how the transmission and drivetrain transfer that motion to the axels? Then there’s the friction between the tires and the road, and between the air and the car’s exterior, not to mention the steering mechanism, the brake mechanism, the windows, the trunk, the air-conditioning, and so forth. Don’t they have to understand these things in order to design and build cars?” Yes, they do. But language technology is different from traditional electro-chemical-mechanical technology, very different.

Just why that is so is a difficult and tricky business, something well beyond the scope of this essay. But I can make some observations. Last month I published “Chess and Language As Paradigmatic Cases for Artificial Intelligence.” My point was simply that they present very different computational problems, but I didn’t say much about the computational techniques used in each case. For roughly the first three to four decades, from the 1950s into the 1990s, the computational techniques were analogous to those involved in automobile engineering in the sense the people who designed and wrote the programs did so on the basis of some understanding of how chess and language worked. To construct a chess program you need to know, not only the rules of chess, but chess strategy and tactics: How do you evaluate a board position? How do you plan a sequence of moves? Similarly, to construct a program to translate a text from one language to another, or to answer questions, you need have base your program on some theory of language and even cognition. If you want to build a program that tells small stories, a standard task for the last quarter of a century – something that David Ferrucci worked on and then went on to lead IBM’s Watson project – you need to have some theory of story structure, something AI researchers picked up from folklorists and narratologists.

This older style (which I previously referred to as Old School) is generally known as symbolic computing because it is about computing with symbols, such as language strings, logical propositions, or ordinary arithmetic, with its Arabic style numerals and limited set of operators (addition, subtraction, multiplication, division, and equivalence). More informally, and with a touch of disdain, it is known as GOFAI (Good Old-Fashioned Artificial Intelligence). The newer style is statistical in character, with artificial neural networks such as large language models being a specific kind of statistical method.

Starting in the 1970s, however, various researchers began developing statistical techniques they could use for various tasks. One group of researchers developed statistical techniques for analyzing strings of spoken language (hidden Markov models) from which they developed speech recognition systems. DragonDictate was the first commercially available program; it was released for the DOS operating system in 1990.

Gerard Salton had begun working on so-called vector space semantic models for document retrieval in the 1960s. The basic underlying idea is simply stated: words that occur together in texts do so because their meanings are related. So, you have a program conduct a statistical analysis of a body of texts to determine which words occur together. You then use the results of that analysis for various purposes. Salton was interested in document retrieval, which is more or less what you do when you use an internet search engine, such as Google. But you can use similar techniques to analyze the contents of a text, something that has interested literary critics in the last decade or two. Most importantly for our purposes, vector semantics is the basis of large language models (LLMs), where statistical. These statistical models are opaque to humans. Computers can work with them directly, but we cannot.

It turns out that when you take a very sophisticated analytic engine, known as a transformer, and have it analyze an extremely large corpus of texts, the entire web, the resulting artificial neural net can generate text that is pretty much like the text that humans generate. The resulting chatbots, however, do have various problems. So-called hallucination is perhaps the best-known of these, but there are others. I’ve mentioned some of them in that 2020 paper I already mentioned (“GPT-3: Waterloo or Rubicon? Here be Dragons”). You can find more if you scroll through Gary Marcus’s substack.

The belief that is currently dominant in the industry is that, when we’ve thrown enough compute against a truly gargantuan pile of text to build a model that has more parameters than there are stars in the universe, then all of those problems will disappear. Gary Marcus says that’s nonsense, so does David Ferrucci, Yann Lecun and Fei-Fei Li. I agree.

Now, while we can call out various individual researchers, executives, and investors for their blindness, for not knowing enough about language, cognition, and the human mind to properly evaluate the behavior of these remarkable systems I fear that criticism is misdirected. As I point out in the whaling article:

This blindness has been institutionalized. That one is well-qualified to address questions about the human-level capacities of AIs, regardless of one’s actual knowledge of human capabilities, that is implicit in the institutionalized culture of AI and has been there since Alan Turing published “Computing Machinery and Intelligence” in 1950.

As I have already said, the problem can be traced back to the pipeline that leads from undergraduate school, into graduate or professional school (for some), and on into industry or, for an ever smaller group, into the academy. The problem is collective and institutional.

Nor is it at all obvious to me just how or when this could have been prevented. The performance of ChatGPT surprised everyone, certainly the general public, but also and just as certainly, the scientists and engineers who created it. That performance was not and could not have been predicted when OpenAI started training GPT-3.5. (And why not, pray tell? Because we don’t know how these things work.) I am inclined to believe that such a surprise was all but inevitable at some point. Now we must live with the consequences.

But it is NOT obvious to me that the consequent widespread monoculture was inevitable. If Gary Marcus and others had been taken seriously, say, in the middle of 2023, the commitment to a transformer-centric AI culture might not have been so extreme, and more resources might have gone to investigating and developing other architectures. That’s not what happened. So here we are.

What should we do? What CAN we do?

What Needs To Be Done?

In don’t know how to answer the last question. I’m not a seer. But the answer to the first question is really quite simple: We need to invest, not only in other architectures, but in different kinds of physical substrate, more like the nervous system in structure, performance and even in physical substance. Let me return to Claude’s Shock and Narrowing essay:

What is actually required — and this is an argument that almost nobody in a position to act on it currently wants to hear — is something closer to the Apollo project in scale but Bell Labs in spirit. Not a defined goal pursued by centrally coordinated engineering. But sustained, generously funded, genuinely open-ended scientific inquiry into the nature of intelligence itself: the computational principles underlying biological neural networks; the role of embodiment and physical interaction in the development of reasoning; the architectural alternatives to attention-based transformers; the hardware innovations — neuromorphic, biohybrid, photonic — that might support a genuinely different computational paradigm.

This is the kind of research that markets will not produce, because it cannot be justified on a product roadmap. It is the kind of research that requires the NSF or NIH model — investigator-directed, long-horizon, evaluated on scientific merit rather than commercial potential. And it is the kind of research that, in the current political and economic climate, is being progressively defunded in favor of applied work that can demonstrate near-term returns.

When is that likely to happen? I do not know. If I lean on my innate optimism really hard I can imagine that in, ten years or so the necessary research (which already exists in nascent form) will begin thriving and our institutions will change in ways that allow us to recover from this collapse. For what it’s worth, here are two blog posts sketching what that could look like: “Brave New World: Notes on the next 30 years in AI [Work in Progress]”, March 19, 2026; and “AI for the Next 30 Years: Four kinds of activity that should be pursued.”

Let me give the last word to Claude:

The shock of ChatGPT’s success was real, and the decisions it triggered were understandable. But a civilization that allows a single commercial surprise, however dramatic, to determine the entire research agenda for the most consequential technology in its history has made a coordination error it may spend decades correcting.

***

Enjoying the content on 3QD? Help keep us going by donating now.