“Stochastic parrot” is a misleading metaphor for LLMs

Metaphors are hugely important both to how we think about things and how we structure debate, as a long research tradition within cognitive science attests [1]. Metaphors, as tools, can make us think better about an issue, but they can also lead us astray, depending on what relevant characteristics metaphors make clear and what they obscure. The notion that large language models (LLMs) are, in effect, “stochastic parrots” currently plays a central role in debate on LLMs. What follows are my thoughts on ways in which the metaphor is (now) creating confusion and hindering progress.

This means what follows is as much a comment on that debate as it is on the metaphor itself. It is worth stressing that I take it to be the very function of debate to help us examine our ideas and thoughts in order to improve them. Identifying potential weaknesses or errors in what is being said, is consequently not meant to disparage, but rather an integral part of debate doing exactly what it is meant to be doing —on all sides.

First off then, what is the stochastic parrot metaphor? According to Bender and colleagues (2001),

Text generated by an LM is not grounded in communicative intent, any model of the world, or any model of the reader’s state of mind. It can’t have been, because the training data never included sharing thoughts with a listener, nor does the machine have the ability to do that.

In short, LLMs lack, either partly or wholly, “situationally embedded meaning”. In line with this, I take the phrase “stochastic parrot” to make salient three main things. Like the ‘speech’ of a parrot, the output of LLMs 1) involves repetition without understanding 2) albeit with some probabilistic, generative component, and, in that, it is very much 3) unlike what humans do or produce.

To the extent that the phrase draws people’s attention to the fact that LLMs are not ‘people’ and that something about their workings might involve transition probabilities (say, ‘next token prediction’), the metaphor seems both useful and undoubtedly effective.

Beyond that, however, I now see it giving rise to the following problems in the wider debate:

  1. confusion between what’s ‘in the head’ and ‘in the world’

  2. a false sense of confidence regarding how LLMs and human cognition ‘work’

  3. an illusion of explanatory depth

  4. a misdirection of evaluative effort

  5. a misdirection of discussion about risks and harms

I will give examples of each in turn.

Confusion between what’s ‘in the head’ and ‘in the world’.

The academic literature contains multiple aspects to ‘meaning’ in human language and communication (see Appendix below). Debate continues on their nature, their relative importance, their interrelationship, and the explanatory adequacy of the distinctions scholars have drawn.

However, one fundamental feature of human language, simply as an observational fact, is that it is a conventional system. Words (and sentences) ‘have meaning’ because the speakers of a language use them in certain ways (not necessarily all in exactly the same way, just with sufficient overlap for there to be a convention, see e.g., [2]). In the words of the philosopher Hillary Putnam: in some aspects at least, “meaning ain’t in the head” [3] .

Consequently, when Polly the pet parrot says “Polly has a biscuit” that (in some sense) ‘means’ something, and it can be true or false, regardless of whether Polly *herself* has any idea whatsoever what those sounds she produces ‘mean’, let alone a concept of ‘language’, ‘meaning’, or ‘truth’.

This follows simply from the fact that this aspect of meaning doesn’t rest on any single head, artificial or otherwise, but rather on the practice of a community. And whether “Polly has a biscuit” is true depends not on Polly’s grasp of human language, but on whether she actually has a biscuit.

This makes it wrong, in my view, to claim that the

“tendency of human interlocutors to impute meaning where there is none can mislead both NLP researchers and the general public into taking synthetic text as meaningful”

Bender, Gebru, McMillan-Major, & Shmitchell, (2021)

Specifically, it seemingly equates decoding of meaning with the decoding of *intention*, overlooking the component that rests on decoding of *convention*. The conventional aspects of meaning can make both Polly’s noises and synthetic text ‘meaningful’, and even true.

This is not just an arcane and esoteric point about natural language meaning, or about LLMs. It is central to the very concept (and value) of “computational system”. My pocket calculator doesn’t ‘have a grasp of meaning’. It doesn’t ‘understand’ that 2+2 = 4. But that doesn’t stop it being useful. That utility ultimately rests on there being a semantic mapping somewhere; the calculator would be of no use if 2 and 4 didn’t ‘mean’ something, at least to me. But that doesn’t require that mapping to be internal to the calculator or in any way accessible to it. It simply isn’t something the calculator itself has to ‘know’.

A false sense of confidence regarding how LLMs and human cognition ‘work’

By making salient the contrast between parrot (mindless repetition) and human (genuine communication), the metaphor suggests that we know that what LLMs do is very much not what humans do.

This obscures several basic facts. It obscures that LLMs are actually based on a human inspired computing paradigm. LLMs descend from a paradigm loosely modelled on highly stylised neurons; and that paradigm was born out of attempts to understand human cognition by virtue of trying to understand it as a form of computation.

The framework of cognitive science, which tries to understand human thought as ‘computation’ or information processing, itself exists as a discipline precisely because we *do not* understand (fully) how human language or thought actually work. The multiple questions associated with understanding ‘meaning’ (see Appendix below) extend to other seemingly fundamental notions such as ‘understanding’, ‘reasoning’ or what it means ‘to know’.

And, finally, it obscures that we very much don’t know how LLMs ‘work’ either. “Next token prediction” is part of the ‘stochastic’ part of the parrot. From many online conversations, my impression is that there is a tendency to overlook the distinction between a task “predict next token” and whatever generative model an LLM forms through training that allows it to fulfil that task. And that the mapping from the former to the latter is taken to be more straightforward, and more restrictive, than it is.

An LLM’s behaviour will rest on structure implicit in its input and the representations of this that the system forms during training. Taking the parrot metaphor seriously, by contrast, suggests a characterisation of LLM performance as “repetition with a bit of noise”. That, after all, is what we take real parrots to do.

This may reinforce a mistaken view of ‘training’ or learning as consisting largely of ‘storage’ of previously encountered items. This in turn, I suspect, fuels claims that performance of a model like ChatGPT4 on a benchmark human performance test such as passing the bar exam is unremarkable and simply reflects regurgitation of material already available in the input (even where this is demonstrably not possible as in the essay part of the bar exam, where questions post-date the end of model training [4]).

That perception is, of course, also at odds with the extent to which models such as ChatGPT4 confabulate answers. Neither ‘just repetition’ nor ‘next token prediction’ suffice to explain the production of a made up reference to a fictitious author, with a fictitious title, complete with fake DOI, as this reference was never *in the input* nor (for the same reason) actually ever “the most likely next token” in any straightforward way.

Yet the parrot-like lack of understanding and the emphasis on ‘next token prediction’ are routinely invoked in current online discourse to explain both these successes and failures, more or less at the same time.

The illusion of explanatory depth

This itself might be seen as a manifestation of a more general failing, namely that the ‘stochastic parrots’ metaphor and, with it, the appeal to a lack of “situationally embedded meaning” gives rise to an illusion of explanatory depth.

In fact, not a single computational system we built in the past has, arguably, had “access to situationally embedded meaning” in the sense Bender et al. described above. This includes any simple script or computer programme I have ever written and run (functional or not), through the basic computational devices such as a pocket calculator, through a wide range of now essential systems such as electronic databases, on to computational systems that, by whatever design approach, manage to far exceed aspects of human performance, whether these be programmes doing weather forecasting, IBMs Deep Blue, or AlphaGo. None of them have “access to situationally embedded meaning”.

That means “situationally embedded meaning” is a zero variance predictor vis a vis the entire gamut of capability and utility computational systems have exhibited. By virtue of that, it can neither explain nor predict anything about the performance range of those systems.

In light of that, it seems hard to sustain the notion that it explains anything specifically about LLMs themselves either.

A misdirection of evaluative effort

This arguably has a knock on effect. “Lack of situationally embedded meaning” has been widely picked up not just as an explanation of the limitations of LLMs, it is presented as an in principle restriction on what they can do.

This, falsely in my view, suggests that we know something about the possible behaviour of such systems, without having to look in any detail at their actual behaviour and performance.

It is, of course, the case that there are systems for which we can confidently gauge behaviour solely by virtue of some fundamental characteristics: for any object, if its weight is heavier than that of the volume of water it displaces, I can appeal to that simple fact and don’t need to examine in detail whether it floats.

However, for that to work, we at least need some kind of causal connection between the feature and the relevant behaviour. That is a hard case to make for access to meaning and LLMs, given the just outlined zero correlation.

To understand what LLMs can do, what they can do well, and where they fail, we have to look at and evaluate the behaviour of actual systems. Those capabilities are an empirical question. They vary across LLMs to date, and those varying capabilities in turn determine what useful functions such systems could perform. None of that work can be short circuited by an in principle consideration – let alone one about ‘meaning’ and ‘understanding’.

A misdirection of discussion about risks and harms

If one takes the lack of ‘situationally embedded meaning’ to fundamentally restrict what a computational system can do, then it might also make sense to take that fact to limit what harms such a system could do now or in future.

It should, by now, be clear that ‘lack of situationally embedded meaning’ patently does not (in my view) sufficiently restrict function for that argument to go through.

Because ‘lack of situationally embedded meaning’ explains none of the variance across extant computational systems, it also doesn’t strike me as a meaningful predictor of future performance. Hence it runs the risk of obscuring what data on system performance we actually have to address this question.

For example, there is an inductive argument to be made that there is additional cause for concern, beyond present risks and harms, based on the, empirically observed, rapid improvement of performance as a function of increase in scale in language models to date [5]. The fact that there are discontinuities in the emergence of capabilities as we have increased that scale does not undermine that point; rather it emphasises the uncertainty (high variance) regarding prediction of individual capabilities. This underscores further the limit of an appeal to properties of ‘meaning’ as a gauge of future performance.

If, by contrast, one takes the notion that LLMs are ‘stochastic parrots’ to cap whatever capabilities such systems could ever develop at roughly current levels, then it does make sense to worry mostly about current risks. It then might also make sense to consider one of the main problems with LLMs to be that people dangerously over-estimate LLM capabilities.

It is right to point out that user expectations and understanding of a system are important safety considerations. In keeping with that, anthropomorphising a system might indeed be a concern. However, it strikes me as debatable whether that concern justifies the effort with which comments have sought to clamp down on expressions such as LLMs “hallucinating” because those expressions might foster anthropomorphisation.

It is true that humans have a tendency to anthropomorphise technologies: it’s as if the printer knows I’m in a rush. Yet that doesn’t seem to pose widespread, significant problems in human interactions with devices whereby we expect them to do things wildly beyond their actual capabilities: I’ve never asked my printer to come along for a drink. I, like others, have managed to form a decent understanding of what printers can and cannot do, and that rests, in good part, on the reliability of their observed behaviour: that is, the things my printer does and doesn’t do. Whatever problems I do have with my printer ultimately stem from its failure to function sufficiently reliably. The printer not printing is the problem, not my (limited) anthropomorphisation.

Maybe LLMs will be different for people, and maybe using words such as “hallucination” will make that worse, as opposed to providing a powerful signal that LLM output is sometimes wildly unreliable. But that, too, is an empirical question.

Conclusion

Whether one cares more about current, or more about potential future problems, both or neither, is a value judgment. The extent to which absence of situationally embedded meaning restricts future performance, and hence risk, by contrast, is a causal, empirical claim.

It is an empirical issue what LLMs can do, and it is an empirical issue how they (or human beings) actually work, and what role situationally embedded meaning might play in that. The ‘stochastic parrots’ metaphor conveys something about an otherwise complex and opaque bit of technology, and to that extent, it has been helpful.

But my impression is that it is now a red herring that misleads and distracts. It blocks and derails conversation unintentionally by pointing our thoughts in the wrong direction if we care about how these systems work and what they can do. Even worse, I think it now also functions to block conversation intentionally with increasingly exasperated restatements (ie., “they are just stochastic parrots” —why don’t you get that?).

I think our discourse around LLMs would improve if we shifted our focus. So I would suggest that we put the metaphor to rest, at least for a bit.

References

[1] Lakoff, G., & Johnson, M. (2008). Metaphors we live by. University of Chicago press.

[2] Labov, W. (1973). The boundaries of words and their meanings. New ways of analyzing variation in English.

[3] Putnam, H. (1975). The meaning of” meaning”. Philosophical Papers, Mind, Language, and Reality2, 215-271.

[4] Katz, Daniel Martin and Bommarito, Michael James and Gao, Shang and Arredondo, Pablo, GPT-4 Passes the Bar Exam (March 15, 2023). Available at SSRN: https://ssrn.com/abstract=4389233

[5] Bowman, S. (2023) Eight things to know about language models. https://cims.nyu.edu/~sbowman/eightthings.pdf

Appendix: Aspects of Meaning

‘Meaning’ is studied across multiple disciplines (philosophy, psychology, linguistics..) with divergent perspectives and inconsistent use of terms. The above are aspects of meaning that I would take most people to be happy to distinguish in principle or, at least, in so far as to merit consideration. There is, however, wide ranging disagreement about what any of these things consist of exactly, how separate they are (is there really a boundary between ‘conceptual’ and ‘encyclopaedic’ knowledge, or a boundary between semantics and pragmatics), to what extent they are just idealisations, how important they are to language and communication, which are most basic and so on.