by Ali Minai
But nature is a stranger yet:
The ones that cite her most
Have never passed her haunted house,
Nor simplified her ghost.
—Emily Dickinson, “What Mystery Pervades a Well”
The first article of this 2-part series laid out the idea of emergence in complex systems, discussed how the appearance of abilities such as the generation of grammatical, syntactically correct, and meaningful text can reasonably be seen as an example of emergence, but also why these emergent abilities are just a shadow – or ghost – of the deeper language generation process in humans. This second part gets more deeply into the last point, making a detailed argument for why the linguistic abilities of LLMs should be seen as limited, and what would be needed to extend them.
The Meaning of Meaning
Perhaps the most critical difference between an LLM’s model of language and that of a human is in the way meanings are understood in the two. The task on which LLMs are trained provides no explicit information about meanings, and depends only on knowing the structural relationships between words in text. However, the fact that LLMs almost always use words in meaningful ways indicates that they have an implicit model of meanings too. What is the nature of that? The answer lies almost certainly in a linguistic idea called the distributional hypothesis of meaning, which says that the meaning of a word can be inferred from the statistics of its use in the context of other words. As described above, LLMs based on transformers are pre-disposed to the statistical learning of structural relationships in text, and their representations of meaning must be derived implicitly from this because of the tight linkage between word usage and meaning per the distributional hypothesis. Given enough data, the statistics can become very accurate – hence the meaningful output of GPT-4 et al. But such meanings – though accurate for purposes to usage – are abstractions. In contrast, the meanings in the human mind are grounded in experience. GPT-4 might understand the meaning of the word “burn” as something associated with the words, “fire”, “heat”, “flame”, “smoke”, “pain”, etc., but a person understands it in terms of the sensation of feeling heat and getting burned. Similarly, GPT-4 may use the words “near” and “far” correctly, but it has no experience of an object being near enough to touch or too far away to recognize. Concrete meanings in the human mind are thus grounded in the fact of the human animal being embodied – a physical system with sensations and the capability of action that leads to physical consequences. The LLM is “all talk” – just simulating concrete meanings as ungrounded symbols. Of course, not all concepts in human language are concrete enough to be defined in direct experiential terms, and there is ongoing debate about how the mind grounds abstract meanings. However, it seems plausible that they are grounded in the substrate of more concrete meanings, and thus indirectly in experience as well (see reference  for a recent overview, reference  for my views in detail). It is in this sense above all that LLM models are ghosts in the machine.
Among the most important meanings that disembodied LLMs don’t understand are emotions and motivations – the shapers and drivers of all human experience. Reinforcement learning, i.e., teaching a system by rewarding and punishing trial-and-error behavior, can be used to give AI systems a crude notion of pleasure and pain, but the complexity of human – and all animal – emotions is far greater. Again, embodiment plays a key role. Fear arises from the possibility of feeling bodily pain but also of psychological loss. Love, anger, hate, envy – are all meaningful in the context of a body that seeks comfort and wishes to avoid discomfort; that realizes its limitation and the certainty of its transience; that has experienced domination and dependence; that craves company and attachment. To an LLM, these are all just words to be used correctly. Its blood pressure cannot rise, its heart cannot beat faster, it cannot feel the touch of a hand, a caress or a slap. It cannot stumble or climb, chase or flee, grip or release. Its understanding of all this is worse than second-hand. It’s the same with motivations, which are in a reciprocal relationship with emotions, but are also generated autonomously by the needs of the system: huger, thirst, nesting, mating, etc. All of these things require agency, i.e., the self-generated desire to achieve specific goals, and, in themselves, LLMs have no agency. They sit passively until queried, respond, and then wait passively again. They can, however, serve as the core of a larger system with agency. For example, it may be argued reasonably that an LLM can become fully “embodied” in cyberspace if it is allowed to connect to the Internet, read news feeds, social media, etc., in real-time, and to perform actions such as clicking links and buttons, filling forms, sending messages, etc. In fact, such LLM-based agents are already being built (and, incidentally, are likely to be a critical factor in the risks posed by AI). For something like the equivalents of animal emotions to develop in such an agent, it would need to have affective states such as satisfaction, expectation, anxiety, etc. – either pre-defined or learned – and signals to arouse or suppress them (such as the functions performed by dopamine, serotonin, and norepinephrine in animals). It would need to have a working memory at multiple time scales (so it can keep track of what it is doing), and some sort of self-motivation that drives it to explore and seek satisfaction. In the end, however, the embodiment of a cyberspace agent would be so different from that of a human agent that the two are likely to have very different grounding of meanings even if they share the same vocabulary (e.g., the cyber-agent’s notions of “far” and “near” will not refer to distance in physical space.) The cyber-agent would, at its core, be a real but alien intelligence.
The astonishing success of LLMS has led many to speculate that the elusive goal of achieving artificial general intelligence (AGI) may be at hand, but, in my opinion, this uses an incomplete view of of what constitutes intelligence – at least in the natural sense that animals have it. In his classic book, “Thinking Fast and Slow”, Daniel Kahnemann proposed that mental functions can usefully be divided into two levels: System 1, a fast-acting system that performs its tasks automatically and subconsciously; and System 2 engaged in more deliberative performance of complex tasks. The vast majority of human – and presumably animal – thinking belongs in System 1. But confining this view to thinking alone is inappropriate and inadequate, and is just a reflection of a crypto-dualist brain-body dichotomy implicit in disembodied views of the mind. The body includes the brain, and is not its marionette. The body’s only function is to generate behavior in the form of specific activation patterns of its components. When the activated components are just neurons in the brain, we call it thought, feeling, memory, etc.; when the activated components include muscle fibers and tendons, we term it action. Everything else in the body – delivering nutrients to tissues, processing and propagating signals, secreting hormones, digesting food, etc. – are all behaviors in the same sense, and most of them also involve the autonomic nervous system. From the viewpoint of AI, it is important to look at thought and action as an integrated whole, just as is the case in reality. The System 1-System 2 division applies to action just as it does to thought: Both can be automatic and subconscious or deliberative and deliberate. As Kahnemann and many others have noted – and as our own experience confirms – most of what the brain-body system does is System 1 stuff. As we walk or drive, we do not deliberate before moving each muscle, for if we did, no action could ever be accomplished in time. Instead, we rely on the structure of our bodies and the circuits of the nervous system – configured over millions of years of evolution, years of development from infant to adult, and vast amounts of learning in the brain – to generate useful action smoothly without thinking. This deeply and inherently intelligent “machine” takes in all its sensory input across all modalities, integrates it with its own state, and generates new states of thought, feeling, emotion, action, memory, and action continuously in real-time, just as a rotation is generated in a pinwheel by a breeze. Only a minuscule fraction of these states rise to the level of consciousness; even fewer are the result of deliberation (which, from a non-dualist viewpoint, must itself be seen just a more complex, slower-changing trajectory in the state space.) The key point here is that, even System 2 behavior – thought or action – is built on a deep substrate of System 1 behavior: The key we learn to press when first learning to play a piece of music may be chosen deliberately, but the coordination of intention and movement that allows us to press it at all is all automatic. System 1 is the soil in which System 2 grows.
The separation between mental and physical behavior in the field of AI, and an inordinate focus on System 2, have created a rather distorted view of intelligence. LLMs have reinforced this distortion, giving rise to the idea that AGI can be achieved purely through text-based and image-based learning about the world. It’s an easy view to accept because language and imagery are the primary mediators of virtually all our System 2 functions, but it is also deeply mistaken if the goal of AI is to build intelligence similar to that of humans. Text and images capture a lot of the world, but in very imperfect ways and mediated by the constraints of language for text. In some ways, those constraints are good because they are providing the AI system “pre-symbolified” data, but a lot is also lost – in particular, the ability to represent the deep, multiscale causal relationships inherent in physical experience.
In 1988, roboticist Hans Moravec noted the fact that learning complex tasks such as logical reasoning and playing games like chess are easier for machines to learn than simple sensorimotor tasks such as navigating cluttered spaces. This has come to be known as Moravec’s Paradox. Now things have moved much further in all areas including robotics, but as a recent piece in the Atlantic observed, AI for mental tasks has greatly outpaced progress in robotics. The article suggests – correctly – that a major factor behind this disparity is that the purely mental System 2 functions that current AI is focused on are inherently more amenable to learning from vast amounts of text, image, video, and numerical data, whereas learning actions requires embodiment and direct experience of the real world, which is far more complex, messy, and dangerous than the safely abstract world implicit in data. This is why producing extremely smart chatbots or Go champions is inherently more feasible than putting safe fully self-driving cars on the road. It is also why AI programmers, lawyers and physicians will likely become a reality sooner than useful household robots. You can learn all of medicine from text and data, but you can’t learn to fold laundry – actually fold it, not just the steps – without doing it.
Embodying the Ghost
A natural follow-up to all this is the question of why one cannot simply embed an LLM into a physical agent such as a robot. To be sure, one could connect an LLM with a robot that can translate its output into action (not a trivial task), but that will leave the robot “buried in thought” (to quote Clark Hull’s critique of Edward Tolman’s thoughtful rat) – dependent on deliberative thinking for every micro-action. Perhaps computation will eventually get fast enough and mechanics agile enough for a purely deliberative robot to function in the real world in real time, but that seems both unlikely and inefficient. Animals are agile because evolution has configured their bodies to have useful coordination modes, development has instantiated this configuration and provided the synergies throughout the body that allow complex behavior to arise emergently in real time as the animal interacts with its complex environment, and neural learning has enabled the animal to exploit this repertoire of synergies. That is how automatic action arises.
A crucial difference between today’s artificial living systems and real ones is the nature of their embodiment. At least for now, the robots we build are made from materials such as metal, plastics, composites, etc., rather than organic materials, and their brains are simple neural networks or other types of processors. They typically have fewer degrees of freedom (i.e., ways in which various parts of their body can move) than humans and other mammals, but more importantly, they have far smaller organizational depth. The body of an animal such as a human is a deep, multi-scale complex adaptive system. The brain is certainly the most complex part of this body, but every other part – all organs and tissues, skin, bone, blood, muscle, etc. – are also highly distributed, adaptive, perpetually self-organizing systems with hierarchical depth. They can operate at the level of molecules such as proteins and DNA, sub-cellular structures (receptors, channels, mitochondria, etc.), cells of many different types, specific assemblies of cells (e.g., pancreatic islets and central pattern generators in the nervous system), and larger structures at the organ and multi-organ levels. This makes the body extremely adaptive, self-monitoring, self-protecting, self-healing, capable of numerous modes of sensing (the five basic senses plus balance, proprioception, pain, etc.), capable of locomotion in multiple ways, capable of performing a vast array of actions, etc. In contrast, even the most agile robots (such as Spot built by Boston Dynamics) are extremely clumsy and limited. If meaning is indeed grounded in embodiment, the complexity of that embodiment will have a direct relationship with the complexity of grounded meanings. For example, a metal robot without a fully innervated body would not have body-awareness in the same way as an animal, and thus have a different conceptual map of injury and pain. Interestingly, a team of programmers has recently added ChatGPT to the Boston Dynamics Spot robot, turning it into a talking dog, but so far, it seems to be a superficial integration. Deeper integrations are likely in the near future, but it remains to be seen whether an LLM trained on human language will make sense to a robot with a radically different embodiment, or send it into a paroxysm of self-alienation.
There is already a very well-developed science of having robots learn from experience. In this process the robots develop internal representations just as GPT-4 does, but usually not at that scale because nothing similar to the vast amounts of text data used in LLMs is available for robots. Of course, experience in the world can quickly provide such data, but only if the robot has the right sensors, complex possibilities of action, and a sufficiently complex “brain” – say, something like an LLM – with internal motivations and affective states. How could such a robot be trained? Certainly not by letting it explore its options in the real world! However, it can do so in simulated, virtual reality environments – a method that is already used widely to validate models in robotics. But there is a problem that has slowed down the development of such systems: The self-supervised method made possible by text for LLMs cannot work with the data of real-time experience because the correct output is generally not available. In robotics, the approach has been to use reinforcement learning, but the current methods are very slow and scale poorly. In contrast, animals learn very quickly through reinforcement. The difference lies in the multiscale organization of coordination modes discussed earlier. Evolution and development preconfigure a repertoire of useful coordination modes as primitives of behavior, and reinforcement learning simply needs to learn how to trigger the right combinations. The instantiation of these coordination modes through a gradual process of development ensures that they are learned efficiently by each stage building only on the successful modes learned in earlier stages, e.g., toddling on standing, walking on toddling, running on walking, etc. The standard approach in machine learning, in contrast, is to begin with a naïve system, and then train it to perform a very complex function (as with LLMs). Small wonder, then, that it takes so much data and so much training. There are systematic and ongoing efforts to use evolutionary and developmental methods in robotics, but not at the scale we see with LLMs. Perhaps a short-cut would be to embed an LLM into a naïve robot, and let the robot use it as a starting point for further learning. There are also efforts to develop systematic methods that mediate communication between AI systems and humans, translating the language of each into that of the other.
Animal Brains vs AI Brains
The discussion above has covered the differences between embodied animals and disembodied AI systems such as LLMs. However, it is also instructive to focus in just on the brain, which an LLM is partly attempting to simulate. The biggest difference between the two is obviously in size – the brain has orders of magnitude more neurons and far more neuron types. But the size disparity will soon be overcome. A much more profound difference is in the architectures and computational processes of the two systems. As suggested in part I of this series, LLMs are layered neural networks with hundreds of layers. These layers are of just a few types, and arranged in a very simple feed-forward architecture, giving the system a generic and rather homogeneous organization, i.e., just a simple repeating pattern with small deviations. In contrast, the brain is a very deep, multi-scale, extremely diverse system. Most importantly, it is organized into a specific architecture that is optimized to serve specific functions.
One key feature of the brain is the presence of two distinct sub-systems: A relatively newly evolved component – the cerebral cortex – which has a more generic, but hierarchically organized structure, and a very heterogeneous older part (in an evolutionary sense) that may be termed the sub-cortical component. The advantages of both types of architectures are well-understood from engineering: Heterogeneous systems with very different modules connected in a specific, mostly fixed architecture (e.g., custom-designed application-specific computer chips) are very efficient at performing specific functions, while systems with a generic and flexible architecture (e.g., gate-array chips) are useful for performing a broad range of functions using generic procedures. Though the systems work together in most mental functions, the cortex plays a central role in enabling higher cognitive functions such as reasoning, planning, language, etc., i.e., System 2 functions. Sensorimotor functions involve both components (and some others) and the more primitive functions such as affect, emotion, motivation, etc., are mediated by the older sub-cortical system, hence the outdated trope of the “lizard brain” for a part of it (the limbic system). The subcortical component is critical to the dominant System 1 functions of the brain, and the quest to build truly intelligent, autonomous, self-motivated, human-like robots will need to develop some equivalent of this component and integrate it with the models of higher cognition that LLMs-like systems represent. Whether humanity should try to build such robots is something to think very hard about right now.
The differences between natural and AI brains, of course, have many other important dimensions. For example, real neurons are far more complex, and can perform simultaneous computations at many different levels. In spite of more than a century of massive, concerted research, much of how the brain performs its functions is not well-understood. Should we wait for such understanding to emerge and apply its results to building artificial intelligence that is closer to the natural kind? Given how far AI still is from achieving natural intelligence, this might seem like the best way (as I explain in detail in reference ). However, humans are impatient and technology hates to take the scenic route. It seems much likelier that, after the emergence of powerful models such as GPT-4 for language and Stable Diffusion for image generation, AI will take a non-biological route in pursuit of general intelligence. What alien mind may emerge from that pursuit is anyone’s guess.
- Buccino and I. Colagé (2022) “Grounding abstract concepts and beliefs into experience: The embodied perspective”, Frontiers in Psychology, vol. 13. https://doi.org/10.3389/fpsyg.2022.943765
- A. A. Minai (2023) Deep Intelligence: What AI Should Learn from Nature’s Imagination. Cognitive Computation. https://doi.org/10.1007/s12559-023-10124-9