UlrikeHahn

@bbk.ac.uk

One way in which discussions of AI capabilities are unsatisfying is that they often descend into what feels like argument about words or ‘semantics’. At the same time, discussions often seem at cross-purposes. This post suggests a reason for why this might be particularly common for this topic and why awareness of this reason might help.

The post has three parts: first, a quick reminder of how words `work’ in general; second, an observation about a particular feature of words that describe human behaviour (like walk, read, reason, summarise etc.); third, and finally, examples of confusion this feature has caused in AI debate.

Part 1: How words work

When talking about words and meaning it’s helpful to distinguish three things: the linguistic label (the English word ‘dog’), the thing in the world the word is used to refer to (actual dogs -the word’s extension or referent) and whatever properties of dogs determine to what creatures we apply the word (the concept or intension).

If we think of words picking out things in the world (referents) as a kind of pointing, then it’s a fundamental feature of words, and the way we use them, that this pointing is vague and approximate (that thing ‘over there’ as opposed to ‘that thing 3.12 meters in front of me’). It just has to be precise enough for you to work out what I’m pointing at, and in many contexts that means not very precise at all. So most of our words don’t have clear definitions and have fuzzy boundaries [1]. Speakers will disagree whether this colour is still blue or already purple, not just with other speakers but with themselves, if we track their answers across multiple occasions (Fig. 1a). And that’s not just colours, it’s also true of everyday objects such cups (as we know from Labov’s classic study [2] asking people whether different objects were a `cup’ or a `bowl’, as in Fig. 1b); and it’s true of abstract words like democracy or justice. It applies to most words one can think of.

Fig. 1

On occasions where we need that pointing to be more precise, we try to ‘clarify’ meanings, refine our concepts, maybe even craft specific definitions. But that’s not something we can usefully do without considering the context and purpose of the finer distinctions we need to draw.

Courts, for example, have to decide what things words apply to: from early 20th century debate on whether electricity was a “movable, physical thing” that could ‘stolen’ in the current sense of the law, to late 20th century debate on whether “marriage” applies only to a man and a woman. All such decisions are ultimately based on complex considerations about social realities, not by thinking about semantics. And any discussion, legal or otherwise, that assumes that real world issues and controversies can be settled just by arguing about the meaning of words is either unintentionally or intentionally missing the point. Finally, this dependence on purpose and goal isn’t restricted to social issues, it applies equally to seemingly esoteric academic debate.

“Is it really still blue?”, asked without regard to purpose, is a pointless question. What things a label applies to is ultimately a matter of convention, and meaning (and with it word boundaries) can change. More importantly, when the precise location of a boundary really matters, actually drawing that boundary will often leave unresolved much of what motivated us to care: The moment we decide that the thing of concern now falls beyond our newly precise boundary, the question simply shifts to ‘it’s not X, but is it similar enough that we need to treat it the same?’.

So it’s no surprise that questions like “can LLMs think?” or “can LLMs reason?” often deteriorate into mere arguments about meaning. However, there is an additional hurdle, I now suspect. The motivation for thinking about this hurdle comes from seeing people who understand full well everything I’ve said so far seemingly stumble in these debates (this includes myself). So on to the central (hopefully more interesting) point.

Part 2: Words for things we do

There is something particularly confusing about words that describe human behaviour or human activity. This goes beyond words that are prominent in LLM debate such as “understand”, “think” “reason”, “summarise”, and also affects words like “read”, “walk”, “talk”, or “write”.

Specifically, it’s the fact that we can seemingly think about human behaviour along three different dimensions: task, means, and attainment (Fig. 2).* For example, when talking about ‘reading’ we can be concerned with a task – ‘extracting meaning from written text’. Or we could focus on the means by which that task performance is achieved: say, sounding out letters versus whole word recognition. Or, finally, we could focus on attainment – how well somebody reads, say, whether to first grade, 6th grade, or adult level.

My sense is that, in our everyday use of words describing behaviour, these different dimensions are simultaneously present. A typical instance of reading is `typical’ along all three dimensions. So moving away from our prototypical cases along any one of these dimensions can raise questions about word boundaries. We might say, for example, “he can’t really read yet, he’s still painstakingly sounding out each word” (attainment too low), “he didn’t really read what I said, he just responded to the tone” (not the same task), or “he’s not really reading, he seems to have simply memorised entire passages” (means).

So we can think of words as applying to referents that lie within a particular region of three dimensional space (the boundaries of which are ‘not too far’ from typical cases of everyday use). While the individual dimensions seem at least somewhat separable, they are ultimately connected, including inferentially: if someone’s reading attainment is really low, chances are they are ‘reading’ by different means than an accomplished reader; likewise, their low attainment raises questions about whether they are truly actually ‘doing the task’. So particular examples might deviate along more than one dimension from the typical thing we call ‘reading’.

                                         Fig. 2

Finally, it is worth note that this analysis extends beyond behaviours to nouns (or adjectives) that are ultimately about behaviour (and assessed via behaviour) such as “intelligence” (“intelligent”) or “creativity” (“creative”).

Importantly, when it comes to answering questions such as “can LLMs really think?” or “can LLMs really reason?” we can draw lines not just along a single dimension, but also decide to give more or less weight to (or even ignore) some of the dimensions.

To illustrate: There is a steady stream of comment that asserts that LMMs can’t really reason because, for example, while they succeeded on inductive tasks, they fail on deductive examples [3], or that they can’t really reason because they fail on a seemingly simple problem [4]. This kind of assessment represents a classification decision based on attainment alone.

There are good reasons to focus on attainment; the Turing test arguably is a pragmatic criterion focussed wholly on ‘attainment’ (an operational definition, not a definition of intelligence per se) that reflects Turing’s belief that other aspects (like ‘means’) were both too difficult to make headway with and of too little practical significance. [5]

At the same time, there is a history of seemingly shifting terminology from attainment to means once human levels of attainment are reached (as with Deep Blue or AlphaGo). Ultimately there’s no ‘there’ there, and just drawing a line won’t answer questions about implications such as ‘should I feel threatened by this thing?’ or ‘how useful is it to me?’’.

The real significance of the three dimensional space, though, lies in the fact that it can give rise to communication at cross purposes and to over-estimating the inferential value of findings on one dimension with respect to another. The final section examines concrete examples of this.

Part 3: Illustrative examples

The first example, to my mind, is the discourse about ‘stochastic parrots’. The original paper [6] made a point about `means’ to the effect that whatever it is LLMs are doing, it cannot be real understanding because training data has no access to ‘situationally embedded meaning’. That point, however, subsequently morphed into an argument about what LLMs could conceivably do (i.e., an argument about `attainment’). A post I wrote in 2023 tries to set out why I think that fails and I’ll simply refer to it here.

Conducting behavioural experiments with LLMs

The second example, involves a typical (now long running) reaction to psychologists and cognitive scientists subjecting LLMs to established experimental paradigms. This response runs from exasperated (…`what are they even thinking?’) to exasperated, but slightly more constructive, such as the suggestion that before applying an established experimental paradigm, at the very least, researchers ‘need to establish construct validity with respect to LLMs first’. In the meantime, this literature probing LLMs with standard human experiments now involves many studies, across many different phenomena, with increasing methodological sophistication [6].

There’s a confusion I see in this exasperated response; it centres around what I’ve called the ‘task’ dimension and the fact that, in any behavioural experiment, there are necessarily multiple conceptual levels involved. There is the real world phenomenon one ultimately wants to understand (‘reading’, ‘reasoning’, ‘deception’, ‘theory of mind’, ‘linguistic competence’) and the specific experimental task used to probe participant behaviour.

For a reasonably well-understood `task’ like ‘reading’ that gives two levels (the familiar, albeit composite, phenomenon ‘reading’ and the experimental task). For other things, such as `theory of mind’ or ‘reasoning’, we often don’t even have a well-defined notion of the target real-world phenomena. So it is given an operational definition (e.g., a set of things you should be able to do if you’ve mastered that thing) and that then motivates the design of specific experimental tasks.

This means also that while you might think that psychologists sort out ‘construct validity’ before they conduct actual experiments (and construct validity is indeed an important concern in some areas, such as psychometric measure or survey design), there are many areas of psychology where you could go for years reading papers without ever encountering the word ‘construct validity’.

Psychology’s somewhat a-theoretical approach has the adverse consequence that research can become overly obsessed with very specific, narrow experimental paradigms (i.e., hundreds of papers on a particular false belief task involving a character named Sally Ann). But the approach also makes sense: it is neither practical, nor possible, to defer, say, the experimental study of human reasoning until that day when we have a complete understanding of what constitutes `reasoning’. It’s not possible precisely because behavioural experiments are a significant part of figuring that out. For example, we can learn useful things about the development of a particular capacity such as reasoning in advance of having a fully developed, theoretically adequate, account of adult competence. What we learn won’t just be interesting in its own right, it might also help develop that adult account.

Once we consider LLMs, the difficulties are compounded. Not only is `reasoning’ an underspecified concept in humans, we also don’t fully understand how LLMs do what they do. So there is no way we could hope to establish construct validity for an experimental paradigm without actually conducting experimental tests.

So we might as well start with those we already have. Running standard experimental paradigms with LLMs is informative both with respect to human abilities and those of LLMs. Experimental results are informative precisely because there are rich literatures on interpreting results in humans (which may be challenged in interesting ways by findings with LLMs) and because the tasks have already been carefully designed with many possible confounds in mind. It is understandable that, outside of that context, a paper that says something like LLMs “showed evidence of theory of mind” might come across as a contribution to AI’s undeniable hype cycle. But there is a strong chance this is not what authors thought they were communicating: Anyone with the research competence required to conduct the work understands the difference between task performance, operational definition, and ultimate real world phenomenon. That difference is so obvious that saying something like “shows evidence of theory of mind” basically means “succeeds on this experimental task” or, at best, “succeeds on this experimental task with implications for my operational definition”.

On the impossibility of achieving Artificial General Intelligence through LLMs

The final example concerns a great recent paper by van Rooij and colleagues that is rethinking the role AI should play within cognitive science. One part of their overall argument is a really interesting proof concerning the feasibility of learning from data something that constitutes Artificial General Intelligence (AGI). That proof is interesting for many reasons, but my interest, here, lies in the support it lends to claims about the impossibility of achieving AGI with LLMs.

Basically, the proof shows that a particular thing can’t be learned with resources that will ever be available in practice. So learning that thing is ‘impossible’ under standard notions of practically feasible computation in computer science.

More specifically, the proof shows the practical impossibility of mastering the following learning problem: you receive pairs of situations and human responses (behaviours) to those situations as inputs (or data) and you must learn to produce accurate human responses to these and to new situations. The really neat thing about the proof is that it shows this to be impossible and does so without appealing in any way to how a learner might do this. The result doesn’t require the specification of an algorithm, so holds regardless of what means are chosen. It also allows error in the output response: of the possible set of behaviours for each situation, the algorithm simply has to select one such behaviour at levels “non-negligibly better than chance”.

The very power of the proof’s abstraction, however, also limits its import with respect to the learnability of AGI. The latter is a term that has crept into the AI literature more recently, and its meaning is itself under debate. But here is one such definition:

“We use AGI to refer to systems that demonstrate broad capabilities of intelligence, including reasoning, planning, and the ability to learn from experience, and with these capabilities at or above human-level” (source)

The thing to note with respect to the three-dimensional space posited here is that the definition makes no mention of `means’, and refers only to `attainment’ (`at or above human-level’) and `tasks’ (`reasoning, planning, and the ability to learn from experience’).

By contrast, the proof doesn’t just drop `means’, it also drop any mention of `task’, focussing only on `attainment’. And the only measure of `attainment’ used is `matching human performance’. But that is not the attainment outcome that AGI is interested in. Rather, our AGI definition concerns performance across a range of (practically) interesting tasks – as illustrated in Fig. 3.

                                                         Fig. 3

The AGI definition does not require task attainment to be at or above human levels across the board. AGI won’t, and doesn’t need to, care about all tasks equally, some simply won’t matter. Nor is the goal to approximate human response selection `non-negligibly above chance’ for all tasks that do matter. Attainment can deviate from human responding by however much it likes, as long as task attainment is better. Performance at, say, arithmetic can be far better than human and, as a result, produce behaviours that are not possible for humans; the only constraint, here, is that humans still need to be able to recognise the response as a legitimate response.**

It’s not the goal of AI research to build a human mimicry machine. That’s the wrong attainment target (even though comparison to human performance forms part of evaluating the actually relevant attainment). The proof tells us something of interest (i.e., we can’t brute force learn our way into a comprehensive human behaviour prediction machine), but that is not what AI is after. I think the mismatch in thinking about `task’ leads the proof to talk past what AGI, from a practical perspective, is actually about.***

Conclusion

That’s it. I think there are multiple implicit dimensions to the meanings of behaviour words. That compounds questions about where to draw boundaries, and it can lead to discussion at cross purposes and confusion. I’ve found thinking things through in this way helpful, so maybe others will find these dimensions useful too.

THANKS

As with my other posts, this one was prompted by particular discussions. I’d like to thank three people in particular – Jonny Saunders, Olivia Guest, and Iris van Rooij- who I confused (and annoyed) with my own confusion. Thanks for helping me figure out what I was actually trying to say (whether I’m right is another matter).

FOOTNOTES

* Readers with a background in cognitive science might be reminded of Marr’s levels (computational, algorithmic, implementation). The distinction drawn is different, but part of why one needs Marr’s levels to describe computational systems in the first place.

** A different way of seeing this is that those `tasks’ like reasoning, drafting contracts etc. on which AGI is trying to match or exceed human performance are nowhere in the LLM learning task. The point of foundation models is that they are trained on one ‘task’ (next token prediction over a large corpus of text) and are then used to form the core of a whole range of different systems, fine-tuned for different purposes. [9] The foundation model isn’t trained (evaluated in training) on its ability to produce functioning Python code or make jokes. And the extent to which its code or jokes match exactly situation-human behaviour pairs matters only indirectly.

***Relatedly, I think it also talks past the concerns of some of what the paper calls ‘makeism’ in cognitive science. As discussed in Part 1, one can define ‘cognition’ or ‘intelligence’ to be cognition or intelligence that exactly matches human levels only. But that just means one has defined away, for example, the study of animal cognition (which will differ in many ways). That project, though, will still be interesting and still be informative to human cognition regardless of what it’s called. So I see little purpose to such a definition.

REFERENCES

[1] McCloskey, M. E., & Glucksberg, S. (1978). Natural categories: Well defined or fuzzy sets?. Memory & Cognition6(4), 462-472.

[2] Labov, W. (1973). The boundaries of words and their meanings. In Bailey, C.-J. N. & Shuy, R. W. (eds), New ways of analyzing variation in English. Washington: Georgetown University Press.

[3] Cheng, K., Yang, J., Jiang, H., Wang, Z., Huang, B., Li, R., ... & Sun, Y. (2024). Inductive or deductive? Rethinking the fundamental reasoning abilities of LLMs. arXiv

[4] Nezhurina, M., Cipolina-Kun, L., Cherti, M., & Jitsev, J. (2024). Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models. arXiv preprint arXiv:2406.02061.

[5] French, R. M. (2000). The Turing Test: the first 50 years. Trends in cognitive sciences4(3), 115-122.

[6] Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March). On the dangers of stochastic parrots: Can language models be too big?🦜. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610-623).

[7] Frank, M. C. (2023). Baby steps in evaluating the capacities of large language models. Nature Reviews Psychology2(8), 451-452.

[8] van Rooij, I., Guest, O., Adolfi, F. G., de Haan, R., Kolokolova, A., & Rich, P. (in press). Reclaiming AI as a theoretical tool for cognitive science. Computational Brain & Behaviour

[9] Bommasani et al. (2021). On the Opportunities and Risks of Foundation Models. Stanford CFRM report found here

This work is licensed under CC BY 4.0 

I’ve been here since November 2022, and sometimes it can feel like one long succession of roiling debates about possible improvements to Mastodon and wider fediverse. These debates aren’t what most people come to online social media for (so much so, they’ve given Mastodon a bit of a reputation), and they generate anger, upset, and friction.

But they seem unavoidable. By design, the fediverse isn’t a top down project. And even Mastodon, where a few individuals have large influence (directly by developing Mastodon, and indirectly via Mastodon’s large user share in the wider fediverse), is built around an ideal of community-focussed decentralisation, and is a non-profit funded via donation, much of it from users.

That alone would create a need for discussion. The fediverse is about online social media, though, and that creates a need for discussion far beyond what comes from it being a decentralised, community-driven project. This is what this post is about.

Read more...

There’s been intense discussion across the fediverse, GitHub, blogs, and articles about a bridge that would let you use a Mastodon account to follow people on Bluesky, see their posts, reply, like and repost them —and vice versa. It’s an exciting prospect: there is quality content on Bluesky, and it feels in the spirit of an open social web that connects people without restriction by platform (particularly, corporate owned platform). For some, this quickly elevated the bridge to a decisive step in the future of decentralised online social media and even for the ‘future of the internet’ itself.

Prompted partly by the fact that it was first floated as being opt-out there was a wave of backlash that the bridge violated user consent and potentially endangered vulnerable groups (as one can only opt-out of something one actually knows exists). This was countered by arguments claiming that opt-out consent was sufficient (“the fediverse is built on opt-out consent”), was already given by virtue of signing up and publicly posting, or simply wasn’t relevant (“if you want privacy, don’t post on the internet”).

Little discussion involved GDPR – the EU’s General Data Protection Legislation, so this post is my attempt to work through issues raised by the bridge, consent, and GDPR, why it’s tricky and why one should care.

Read more...

What follows are some more or less connected thoughts on what social media for science could and should be. There are excellent articulations of what a social media future for science might look like, such as the multiple articles and blogs by Bjoern Brembs. This is not that! Instead, I’m trying to articulate for myself some constraints, tensions, and road blocks to such a future. My hope is that deeper discussion of those can help move us forward.

Read more...

Metaphors are hugely important both to how we think about things and how we structure debate, as a long research tradition within cognitive science attests [1]. Metaphors, as tools, can make us think better about an issue, but they can also lead us astray, depending on what relevant characteristics metaphors make clear and what they obscure. The notion that large language models (LLMs) are, in effect, “stochastic parrots” currently plays a central role in debate on LLMs. What follows are my thoughts on ways in which the metaphor is (now) creating confusion and hindering progress.

This means what follows is as much a comment on that debate as it is on the metaphor itself. It is worth stressing that I take it to be the very function of debate to help us examine our ideas and thoughts in order to improve them. Identifying potential weaknesses or errors in what is being said, is consequently not meant to disparage, but rather an integral part of debate doing exactly what it is meant to be doing —on all sides.

Read more...

with Michael Maes & Davide Grossi

1. Arguments about adding search to Mastodon

Mastodon presently does not support full text search: it is not possible to search for words that are not accompanied by # hashtag, that is, it is not possible to search for words that have not been intentionally made available for search. Users (particularly those migrating from Twitter) regularly lament this absence, leading for calls to its inclusion.

However, the absence of full text search has, to date, been a conscious design choice: For example, Mastodon’s founder and lead developer Eugen Rochko noted  in 2017:

“If text search is ever implemented, it should be limited to your home timeline/mentions only. Lack of full-text search on general content is intentional, due to negative social dynamics of it in other networks.”

In keeping with this, the Mastodon Project currently supports only limited functionality search:

“Mastodon’s full-text search allows logged in users to find results from their own statuses, their mentions, their favourites, and their bookmarks. It deliberately does not allow searching for arbitrary strings in the entire database.”

At issue are multiple underlying concerns from protecting marginalised groups from intrusion and harassment to consciously anti-viral design (on Mastodon’s anti-viral design more generally, see Thompson, 2022).

The absence of unrestricted text search is contested through regular requests for the inclusion of search through posts on Mastodon, specific Github requests for feature changes,  but also attempts to simply circumvent these restrictions through the provision of alternative software tools.

Arguments for unrestricted text search typically appeal to individual freedom or individual ‘rights’ as a rationale. In keeping with this individual-focussed perspective, much of the discussion on ameliorating the impact of search focusses on consent. While opt-in to discoverability as a design-choice clearly addresses some concerns about search, it does not address the fundamental issue that “negative social dynamics” (real or imagined) are *system level properties* of the online discourse as a whole. This means they may potentially impact all users in some way, whether they consent or not. This raises multiple questions about the relationship between individuals on the platform, between individuals and the collective, and platform governance. We return to these implications below; the main goal of this blog post, however, is to make such talk of ‘system level properties’ non-mysterious in order to help promote better thinking and discussion about the design of online communication systems. For this, we present a toy example designed to provide some basic insights into online communities as ‘complex systems’ and how they might be altered by a feature such as text search.

Read more...

(U Hahn, Jan. 2/2023)

There is currently lively debate on Mastodon about whether or not to introduce QT's – a feature that many considered integral to the Twitter experience but that Mastodon has (to date) eschewed on a variety of grounds, most notably concerns about “dunking”, “pile ons”, and other mis-uses, but also notions of consent, wanting to promote conversation with people instead of about people, and so on. The point of this blog post is not whether Mastodon should or should not introduce QTs. The point is to consider some arguments on QTs floating around on the platform, and evaluate them from an argumentation theory perspective. Why that's useful or interesting, I'll say more about at the end; for now, I'll just offer the intuition that basing decisions on good arguments as opposed to bad ones is more likely to lead to good outcomes.

The four arguments (reasons) are:

  1. Blocking QTs won't stop bad behaviour

  2. QTs are being blocked without evidence that they promote harm

  3. We need QTs

  4. QTs will happen anyway, so we should focus on making them safe

I'll go through each of these in turn. All are interesting, and each analysis shows something quite different. Some will be more technical than others (1. and 2.), so do read on if the first two are too geeky, you might still find the next two more interesting.

Read more...