UlrikeHahn

@bbk.ac.uk

There’s been intense discussion across the fediverse, GitHub, blogs, and articles about a bridge that would let you use a Mastodon account to follow people on Bluesky, see their posts, reply, like and repost them —and vice versa. It’s an exciting prospect: there is quality content on Bluesky, and it feels in the spirit of an open social web that connects people without restriction by platform (particularly, corporate owned platform). For some, this quickly elevated the bridge to a decisive step in the future of decentralised online social media and even for the ‘future of the internet’ itself.

Prompted partly by the fact that it was first floated as being opt-out there was a wave of backlash that the bridge violated user consent and potentially endangered vulnerable groups (as one can only opt-out of something one actually knows exists). This was countered by arguments claiming that opt-out consent was sufficient (“the fediverse is built on opt-out consent”), was already given by virtue of signing up and publicly posting, or simply wasn’t relevant (“if you want privacy, don’t post on the internet”).

Little discussion involved GDPR – the EU’s General Data Protection Legislation, so this post is my attempt to work through issues raised by the bridge, consent, and GDPR, why it’s tricky and why one should care.

Read more...

What follows are some more or less connected thoughts on what social media for science could and should be. There are excellent articulations of what a social media future for science might look like, such as the multiple articles and blogs by Bjoern Brembs. This is not that! Instead, I’m trying to articulate for myself some constraints, tensions, and road blocks to such a future. My hope is that deeper discussion of those can help move us forward.

Read more...

Metaphors are hugely important both to how we think about things and how we structure debate, as a long research tradition within cognitive science attests [1]. Metaphors, as tools, can make us think better about an issue, but they can also lead us astray, depending on what relevant characteristics metaphors make clear and what they obscure. The notion that large language models (LLMs) are, in effect, “stochastic parrots” currently plays a central role in debate on LLMs. What follows are my thoughts on ways in which the metaphor is (now) creating confusion and hindering progress.

This means what follows is as much a comment on that debate as it is on the metaphor itself. It is worth stressing that I take it to be the very function of debate to help us examine our ideas and thoughts in order to improve them. Identifying potential weaknesses or errors in what is being said, is consequently not meant to disparage, but rather an integral part of debate doing exactly what it is meant to be doing —on all sides.

First off then, what is the stochastic parrot metaphor? According to Bender and colleagues (2001),

Text generated by an LM is not grounded in communicative intent, any model of the world, or any model of the reader’s state of mind. It can’t have been, because the training data never included sharing thoughts with a listener, nor does the machine have the ability to do that.

In short, LLMs lack, either partly or wholly, “situationally embedded meaning”. In line with this, I take the phrase “stochastic parrot” to make salient three main things. Like the ‘speech’ of a parrot, the output of LLMs 1) involves repetition without understanding 2) albeit with some probabilistic, generative component, and, in that, it is very much 3) unlike what humans do or produce.

To the extent that the phrase draws people’s attention to the fact that LLMs are not ‘people’ and that something about their workings might involve transition probabilities (say, ‘next token prediction’), the metaphor seems both useful and undoubtedly effective.

Beyond that, however, I now see it giving rise to the following problems in the wider debate:

  1. confusion between what’s ‘in the head’ and ‘in the world’

  2. a false sense of confidence regarding how LLMs and human cognition ‘work’

  3. an illusion of explanatory depth

  4. a misdirection of evaluative effort

  5. a misdirection of discussion about risks and harms

I will give examples of each in turn.

Confusion between what’s ‘in the head’ and ‘in the world’.

The academic literature contains multiple aspects to ‘meaning’ in human language and communication (see Appendix below). Debate continues on their nature, their relative importance, their interrelationship, and the explanatory adequacy of the distinctions scholars have drawn.

However, one fundamental feature of human language, simply as an observational fact, is that it is a conventional system. Words (and sentences) ‘have meaning’ because the speakers of a language use them in certain ways (not necessarily all in exactly the same way, just with sufficient overlap for there to be a convention, see e.g., [2]). In the words of the philosopher Hillary Putnam: in some aspects at least, “meaning ain’t in the head” [3] .

Consequently, when Polly the pet parrot says “Polly has a biscuit” that (in some sense) ‘means’ something, and it can be true or false, regardless of whether Polly *herself* has any idea whatsoever what those sounds she produces ‘mean’, let alone a concept of ‘language’, ‘meaning’, or ‘truth’.

This follows simply from the fact that this aspect of meaning doesn’t rest on any single head, artificial or otherwise, but rather on the practice of a community. And whether “Polly has a biscuit” is true depends not on Polly’s grasp of human language, but on whether she actually has a biscuit.

This makes it wrong, in my view, to claim that the

“tendency of human interlocutors to impute meaning where there is none can mislead both NLP researchers and the general public into taking synthetic text as meaningful”

Bender, Gebru, McMillan-Major, & Shmitchell, (2021)

Specifically, it seemingly equates decoding of meaning with the decoding of *intention*, overlooking the component that rests on decoding of *convention*. The conventional aspects of meaning can make both Polly’s noises and synthetic text ‘meaningful’, and even true.

This is not just an arcane and esoteric point about natural language meaning, or about LLMs. It is central to the very concept (and value) of “computational system”. My pocket calculator doesn’t ‘have a grasp of meaning’. It doesn’t ‘understand’ that 2+2 = 4. But that doesn’t stop it being useful. That utility ultimately rests on there being a semantic mapping somewhere; the calculator would be of no use if 2 and 4 didn’t ‘mean’ something, at least to me. But that doesn’t require that mapping to be internal to the calculator or in any way accessible to it. It simply isn’t something the calculator itself has to ‘know’.

A false sense of confidence regarding how LLMs and human cognition ‘work’

By making salient the contrast between parrot (mindless repetition) and human (genuine communication), the metaphor suggests that we know that what LLMs do is very much not what humans do.

This obscures several basic facts. It obscures that LLMs are actually based on a human inspired computing paradigm. LLMs descend from a paradigm loosely modelled on highly stylised neurons; and that paradigm was born out of attempts to understand human cognition by virtue of trying to understand it as a form of computation.

The framework of cognitive science, which tries to understand human thought as ‘computation’ or information processing, itself exists as a discipline precisely because we *do not* understand (fully) how human language or thought actually work. The multiple questions associated with understanding ‘meaning’ (see Appendix below) extend to other seemingly fundamental notions such as ‘understanding’, ‘reasoning’ or what it means ‘to know’.

And, finally, it obscures that we very much don’t know how LLMs ‘work’ either. “Next token prediction” is part of the ‘stochastic’ part of the parrot. From many online conversations, my impression is that there is a tendency to overlook the distinction between a task “predict next token” and whatever generative model an LLM forms through training that allows it to fulfil that task. And that the mapping from the former to the latter is taken to be more straightforward, and more restrictive, than it is.

An LLM’s behaviour will rest on structure implicit in its input and the representations of this that the system forms during training. Taking the parrot metaphor seriously, by contrast, suggests a characterisation of LLM performance as “repetition with a bit of noise”. That, after all, is what we take real parrots to do.

This may reinforce a mistaken view of ‘training’ or learning as consisting largely of ‘storage’ of previously encountered items. This in turn, I suspect, fuels claims that performance of a model like ChatGPT4 on a benchmark human performance test such as passing the bar exam is unremarkable and simply reflects regurgitation of material already available in the input (even where this is demonstrably not possible as in the essay part of the bar exam, where questions post-date the end of model training [4]).

That perception is, of course, also at odds with the extent to which models such as ChatGPT4 confabulate answers. Neither ‘just repetition’ nor ‘next token prediction’ suffice to explain the production of a made up reference to a fictitious author, with a fictitious title, complete with fake DOI, as this reference was never *in the input* nor (for the same reason) actually ever “the most likely next token” in any straightforward way.

Yet the parrot-like lack of understanding and the emphasis on ‘next token prediction’ are routinely invoked in current online discourse to explain both these successes and failures, more or less at the same time.

The illusion of explanatory depth

This itself might be seen as a manifestation of a more general failing, namely that the ‘stochastic parrots’ metaphor and, with it, the appeal to a lack of “situationally embedded meaning” gives rise to an illusion of explanatory depth.

In fact, not a single computational system we built in the past has, arguably, had “access to situationally embedded meaning” in the sense Bender et al. described above. This includes any simple script or computer programme I have ever written and run (functional or not), through the basic computational devices such as a pocket calculator, through a wide range of now essential systems such as electronic databases, on to computational systems that, by whatever design approach, manage to far exceed aspects of human performance, whether these be programmes doing weather forecasting, IBMs Deep Blue, or AlphaGo. None of them have “access to situationally embedded meaning”.

That means “situationally embedded meaning” is a zero variance predictor vis a vis the entire gamut of capability and utility computational systems have exhibited. By virtue of that, it can neither explain nor predict anything about the performance range of those systems.

In light of that, it seems hard to sustain the notion that it explains anything specifically about LLMs themselves either.

A misdirection of evaluative effort

This arguably has a knock on effect. “Lack of situationally embedded meaning” has been widely picked up not just as an explanation of the limitations of LLMs, it is presented as an in principle restriction on what they can do.

This, falsely in my view, suggests that we know something about the possible behaviour of such systems, without having to look in any detail at their actual behaviour and performance.

It is, of course, the case that there are systems for which we can confidently gauge behaviour solely by virtue of some fundamental characteristics: for any object, if its weight is heavier than that of the volume of water it displaces, I can appeal to that simple fact and don’t need to examine in detail whether it floats.

However, for that to work, we at least need some kind of causal connection between the feature and the relevant behaviour. That is a hard case to make for access to meaning and LLMs, given the just outlined zero correlation.

To understand what LLMs can do, what they can do well, and where they fail, we have to look at and evaluate the behaviour of actual systems. Those capabilities are an empirical question. They vary across LLMs to date, and those varying capabilities in turn determine what useful functions such systems could perform. None of that work can be short circuited by an in principle consideration – let alone one about ‘meaning’ and ‘understanding’.

A misdirection of discussion about risks and harms

If one takes the lack of ‘situationally embedded meaning’ to fundamentally restrict what a computational system can do, then it might also make sense to take that fact to limit what harms such a system could do now or in future.

It should, by now, be clear that ‘lack of situationally embedded meaning’ patently does not (in my view) sufficiently restrict function for that argument to go through.

Because ‘lack of situationally embedded meaning’ explains none of the variance across extant computational systems, it also doesn’t strike me as a meaningful predictor of future performance. Hence it runs the risk of obscuring what data on system performance we actually have to address this question.

For example, there is an inductive argument to be made that there is additional cause for concern, beyond present risks and harms, based on the, empirically observed, rapid improvement of performance as a function of increase in scale in language models to date [5]. The fact that there are discontinuities in the emergence of capabilities as we have increased that scale does not undermine that point; rather it emphasises the uncertainty (high variance) regarding prediction of individual capabilities. This underscores further the limit of an appeal to properties of ‘meaning’ as a gauge of future performance.

If, by contrast, one takes the notion that LLMs are ‘stochastic parrots’ to cap whatever capabilities such systems could ever develop at roughly current levels, then it does make sense to worry mostly about current risks. It then might also make sense to consider one of the main problems with LLMs to be that people dangerously over-estimate LLM capabilities.

It is right to point out that user expectations and understanding of a system are important safety considerations. In keeping with that, anthropomorphising a system might indeed be a concern. However, it strikes me as debatable whether that concern justifies the effort with which comments have sought to clamp down on expressions such as LLMs “hallucinating” because those expressions might foster anthropomorphisation.

It is true that humans have a tendency to anthropomorphise technologies: it’s as if the printer knows I’m in a rush. Yet that doesn’t seem to pose widespread, significant problems in human interactions with devices whereby we expect them to do things wildly beyond their actual capabilities: I’ve never asked my printer to come along for a drink. I, like others, have managed to form a decent understanding of what printers can and cannot do, and that rests, in good part, on the reliability of their observed behaviour: that is, the things my printer does and doesn’t do. Whatever problems I do have with my printer ultimately stem from its failure to function sufficiently reliably. The printer not printing is the problem, not my (limited) anthropomorphisation.

Maybe LLMs will be different for people, and maybe using words such as “hallucination” will make that worse, as opposed to providing a powerful signal that LLM output is sometimes wildly unreliable. But that, too, is an empirical question.

Conclusion

Whether one cares more about current, or more about potential future problems, both or neither, is a value judgment. The extent to which absence of situationally embedded meaning restricts future performance, and hence risk, by contrast, is a causal, empirical claim.

It is an empirical issue what LLMs can do, and it is an empirical issue how they (or human beings) actually work, and what role situationally embedded meaning might play in that. The ‘stochastic parrots’ metaphor conveys something about an otherwise complex and opaque bit of technology, and to that extent, it has been helpful.

But my impression is that it is now a red herring that misleads and distracts. It blocks and derails conversation unintentionally by pointing our thoughts in the wrong direction if we care about how these systems work and what they can do. Even worse, I think it now also functions to block conversation intentionally with increasingly exasperated restatements (ie., “they are just stochastic parrots” —why don’t you get that?).

I think our discourse around LLMs would improve if we shifted our focus. So I would suggest that we put the metaphor to rest, at least for a bit.

References

[1] Lakoff, G., & Johnson, M. (2008). Metaphors we live by. University of Chicago press.

[2] Labov, W. (1973). The boundaries of words and their meanings. New ways of analyzing variation in English.

[3] Putnam, H. (1975). The meaning of” meaning”. Philosophical Papers, Mind, Language, and Reality2, 215-271.

[4] Katz, Daniel Martin and Bommarito, Michael James and Gao, Shang and Arredondo, Pablo, GPT-4 Passes the Bar Exam (March 15, 2023). Available at SSRN: https://ssrn.com/abstract=4389233

[5] Bowman, S. (2023) Eight things to know about language models. https://cims.nyu.edu/~sbowman/eightthings.pdf

Appendix: Aspects of Meaning

‘Meaning’ is studied across multiple disciplines (philosophy, psychology, linguistics..) with divergent perspectives and inconsistent use of terms. The above are aspects of meaning that I would take most people to be happy to distinguish in principle or, at least, in so far as to merit consideration. There is, however, wide ranging disagreement about what any of these things consist of exactly, how separate they are (is there really a boundary between ‘conceptual’ and ‘encyclopaedic’ knowledge, or a boundary between semantics and pragmatics), to what extent they are just idealisations, how important they are to language and communication, which are most basic and so on.

with Michael Maes & Davide Grossi

1. Arguments about adding search to Mastodon

Mastodon presently does not support full text search: it is not possible to search for words that are not accompanied by # hashtag, that is, it is not possible to search for words that have not been intentionally made available for search. Users (particularly those migrating from Twitter) regularly lament this absence, leading for calls to its inclusion.

However, the absence of full text search has, to date, been a conscious design choice: For example, Mastodon’s founder and lead developer Eugen Rochko noted  in 2017:

“If text search is ever implemented, it should be limited to your home timeline/mentions only. Lack of full-text search on general content is intentional, due to negative social dynamics of it in other networks.”

In keeping with this, the Mastodon Project currently supports only limited functionality search:

“Mastodon’s full-text search allows logged in users to find results from their own statuses, their mentions, their favourites, and their bookmarks. It deliberately does not allow searching for arbitrary strings in the entire database.”

At issue are multiple underlying concerns from protecting marginalised groups from intrusion and harassment to consciously anti-viral design (on Mastodon’s anti-viral design more generally, see Thompson, 2022).

The absence of unrestricted text search is contested through regular requests for the inclusion of search through posts on Mastodon, specific Github requests for feature changes,  but also attempts to simply circumvent these restrictions through the provision of alternative software tools.

Arguments for unrestricted text search typically appeal to individual freedom or individual ‘rights’ as a rationale. In keeping with this individual-focussed perspective, much of the discussion on ameliorating the impact of search focusses on consent. While opt-in to discoverability as a design-choice clearly addresses some concerns about search, it does not address the fundamental issue that “negative social dynamics” (real or imagined) are *system level properties* of the online discourse as a whole. This means they may potentially impact all users in some way, whether they consent or not. This raises multiple questions about the relationship between individuals on the platform, between individuals and the collective, and platform governance. We return to these implications below; the main goal of this blog post, however, is to make such talk of ‘system level properties’ non-mysterious in order to help promote better thinking and discussion about the design of online communication systems. For this, we present a toy example designed to provide some basic insights into online communities as ‘complex systems’ and how they might be altered by a feature such as text search.

2. Online communication as a complex system

Complex systems are systems characterised by large numbers of interactions between their (often simple) components that give rise to emergent properties of the system as a whole that are typically difficult to predict (e.g., Ladyman et al., 2013).

Communicating individuals form social networks (such as the sample network seen in the figure below) in which individuals are nodes and their communication paths are represented by links between those nodes. Even though each individual may have direct connections with only a handful of other individuals, the interconnections between individuals may link them, collectively, into a much, much larger network. In this way, individual Mastodon users are linked not only to direct followers, but indirectly to those followers’ followers and so on.

Understanding online communication platforms as complex systems consequently involves trying to understand how information is propagated across such networks and what kind of emergent patterns this may to give rise to. For example, although it is individuals that read, write, and boost posts, we may think of their interactions combining to determine characteristics of the discourse across the network as a whole. Is the discourse predominantly friendly or hostile? Do focal topics that attract widespread interest persist, or is attention fleeting and fragmented? What kinds of community norms govern interactions?

To understand questions like these, researchers make use of computer simulations involving agent based models (ABM). Such models allow one to explore the behaviour of systems in ways that we could never do in real life. Figure 1 shows the interface of a simple model created to illustrate some basic aspects of online communication systems (a link to the model itself and instructions for further, hands on, exploration can be found below).

Fig. 1. Interface of the Agent Based Modelling software NetLogo with a social network.

The simulation creates a simple social network. It then simulates a contagion process across that network. Such models are widely used to study opinion dynamics or the spread of behaviour. They simply appropriate the notion of ‘being infected’ to receiving a message or witnessing a behaviour. Depending on what we are trying to model, it may make sense to think of the process of spread as either a ‘simple’ or a ‘complex contagion’. For a simple contagion, a single exposure is enough for ‘infection’. For a complex contagion, multiple exposures are required. This can be modelled, for example, by setting a threshold: an individual only becomes ‘infected’ once the number of infected neighbours exceeds the specified threshold. This is typically more appropriate than a single contagion for modelling the adoption of behaviours (e.g., Guilbeault et al., 2018).

The threshold value, like the number of agents in the network, or the structure of that network is a ‘parameter’: an attribute of the configuration of the model. The goal of agent based simulations is to gain insight into the patterns that emerge from agent interactions and understand how those patterns depend on the values of the model’s parameters.

For our toy example, let’s now think of the ‘infection’ as something like “anger”. Anger may be socially mediated in as much as others’ angry behaviour may make us angry in turn. So we will interpret the model as capturing the spread of anger via a diffusion process on the network: let’s assume, for example, that at each time step in the model, agents communicate a message and the message content reflects their current state as either not angry (black) or angry (red). The model is set up with an initial number of randomly selected ‘angry’ agents. Our interest then lies with the extent to which anger spreads.

Exploring our simple model reveals that anger will eventually spread to all members of the network over time, regardless of the exact number of initially angry agents, the size of the network, or its structure, if we set the contagion threshold to 1, that is, to a simple contagion.

This is no longer true if we raise the contagion threshold to 2, so that an individual agent becomes ‘angry’ only when 2 of its direct neighbours are themselves ‘angry’. This means that ‘anger’ requires sufficient support among the direct neighbours of an agent: without that support, the contagion will stop in that neighbourhood, and may eventually come to a stop in the network as a whole.

This means also that anything that makes it more likely that multiple neighbours become infected will change the dynamics of spread. This includes increasing the proportion of initially infected agents. But it also includes changes to the structure of the network, for example, increasing the number of neighbours an agent has (i.e., increasing the number of links), or increasing the clustering whereby an agent’s neighbours are themselves neighbours. All of these now matter, and will interact to produce the patterns of spread, making that spread increasingly difficult to predict.

This brings us back to the topic of this blog post: unrestricted text search. How might we think about the effects of adding search? In effect, what search does is that it (dynamically and temporarily) rewires the network. Instead of seeing posts only from those we follow, we (and others!) can now see posts from arbitrary individuals as a function of message content.

So what happens in our toy model if we introduce some limited rewiring in the model as a result of hypothetical ‘search’? On a given time-step there is now a small probability (p=.02) that a randomly selected node receives three additional random links as a result of a ‘search’. All else about the model stays the same. Yet this simple addition dramatically changes the behaviour of the system. The complex contagion (with the familiar threshold = 2) now stands a decent chance of, once again, reaching all agents in the network. Figure 2 below shows the outcome of 1000 runs of the model (each with a different random starting configuration) both with and without this rewiring (‘search’). The x-axis shows the respective proportions of ‘angry’ agents in the population, and the y-axis shows the count of how many simulated networks ended up with that proportion.

The plot reveals several important features of our model, and with it of complex systems more generally:

  1. the system’s behaviour varies as a result of random factors (e.g., which agents are randomly selected for initial contagion, and which are randomly selected for ‘search’).

  2. as result, there is no single answer as to “what happens” (e.g., when we introduce ‘search’). Rather, there is a space of possible outcomes, some of which are more probable than others.

  3. there are discontinuities in the space of possible outcomes: there are *no* model runs that end in 60% angry. Rather, once a large enough subset is reached, all agents are reached eventually.

  4. a minor change to the model can lead to very different outcomes: no run without ‘search’ saw spread to all agents.

  5. the impact of ‘search’ is not restricted to the agents that are rewired. On those runs where all agents eventually become angry, the vast majority end up in a different state than they would have without that intervention.

These points are echoed in Fig. 3 which provides further insight into the relevant mechanisms. Our stylised ‘search’ adds links to the network. To examine the impact of this, the plots show the correlation between final proportion of ‘angry’ agents and the number of final links and the number of ‘searches’ that took place (right hand), with the top panels (green) showing the results of the 1000 ‘no search’ runs, and the bottom panels (blue) showing the results of the 1000 ‘search’ runs.

We can again see clearly how much variability there is even under exactly the same parameter settings. We can see also that increasing numbers of searches (and with that additional links) increase the chances of spread, but that correlation is loose.

3. Establishing an evidence base?

So what does this simple model tell us? It is nothing like real communication or the real scale of a network such as the one comprising all Mastodon users, and our ‘search’ and ‘anger’ are nothing like real search or real anger.

The latter are just a label; we could have equally chosen ‘happiness’, or a wholly uninformative label such as ‘gleeb’. The meaning of the labels we chose to describe the model is exhausted in what they actually represent: state changes in a particular social network model. What we really want to know is what would happen *on Mastodon* if we introduced unrestricted text search.

This toy model cannot tell us that. It can still tell us plenty that is relevant, however. All of the general characteristics 1-5 above apply to complex systems more generally. Making a model more realistic will not generally make these features go away.

This means also that there are limits on what any kind of real-world empirical study could tell us. Even if we could run ‘experiments’ on Mastodon (or some other platform) that allowed us to look at the effects of introducing search, an individual ‘run’ of our real world network remains just a single point in a space of possible outcomes —outcomes that could have been different given random variation.

That space of possible outcomes likely contains non-linearities and phase transitions (Sole et al., 1996). So even if we combine our best methods (experiments, simulations, observational studies) we will likely understand only something about the broad directions in which changing a parameter might push the system. It will remain the case that ‘just a little bit more’ could lead to a qualitatively wholly different outcome. And the interactions of multiple parameters will be even more resistant to our understanding.

While this means that we will not likely ever get a definitive answer on what adding search would do to Mastodon, it does shed light on some of the arguments and intuitions in that debate. First and foremost, the idea of restricting search as part of an anti-viral design stance is plausible, in the sense that it is plausible that search and the degree to which things may go viral are connected. Second, examination of even our toy model as a complex system shows that one cannot conclude that because a bit of something had little to no impact, a bit more will continue to be inoccuous: #hashtag search is already search, but that doesn’t mean that adding more won’t radically transform the system. A bit more can become radically different —so friction matters as a factor dampening the rate at which individuals take certain actions, here and elsewhere. Third, ‘minority actions’ can have global, system wide, effects far beyond those directly involved. That may seem mysterious in the absence of consideration of an actual complex system, but our simple example shows that it is not. This means also that “consent” to having one’s own posts included in unrestricted search does not solve all issues: one can’t consent on behalf of others to effects that *they* will incur.

4. Who decides?

The upshot of all of this is that decisions on system design for online communication platforms seem unlikely to occur in a context with an abundance of evidence that clarifies precisely what effects a particular change to the system would have. This makes all the more important considerations of who gets to decide and on what grounds.

It is natural to try and cast these issues in terms of ‘rights’: individual rights of users, rights of those who have invested most in building the platform, and so on. But rights are never limitless, because exercising them typically touches on the rights of others —all the more so when we are dealing with public or collective goods. These raise a host of problems of their own (e.g., Reaume, 1988). And it seems extremely unlikely that those problems will magically disappear or resolve by virtue of decentralisation or federation. So Mastodon —as not just a piece of software, or a development company, but also as a community— might ultimately find itself seeking to develop governance structures to resolve such issues.

References

Guilbeault, D., Becker, J., & Centola, D. (2018). Complex contagions: A decade in review. In Lehmann, S. and Ahn, Y.Y. (eds.) Complex spreading phenomena in social systems: Influence and contagion in real-world social networks, Springer. accessed at: https://arxiv.org/pdf/1710.07606.pdf

Ladyman, J., Lambert, J., & Wiesner, K. (2013). What is a complex system?. European Journal for Philosophy of Science3, 33-67. https://link.springer.com/article/10.1007/s13194-012-0056-8

Réaume, D. (1988). Individuals, Groups, and Rights to Public Goods. The University of Toronto Law Journal38(1), 1-27. https://www.jstor.org/stable/825760

Solé, R. V., Manrubia Cuevas, S., Luque, B., Delgado, J., & Bascompte, J. (1996). Phase transitions and complex systems: Simple, nonlinear models capture complex systems at the edge of chaos. https://digital.csic.es/bitstream/10261/44294/1/COMPLEXITY-96.pdf

Thompson, C. (2022) Twitter alternative: how Mastodon is designed to be “antiviral”. Medium https://uxdesign.cc/mastodon-is-antiviral-design-42f090ab8d51

Appendix: Instructions for model exploration

A link to the model is here. Clicking it will open a version of the model for running within a browser. All Netlogo models have three tabs (see Figure 1 above): the Interface, an Info Tab, and a Code Tab. The Interface Tab lets the user run the model via buttons and to-be-entered parameter values. The Info Tab contains a description of the model, and the Code Tab shows the computer programme itself.

Pressing the Setup button will initialise the model, Go Once will execute a single time step, and Go will let the model run until behaviour stabilises so that there is no more change, at which point the run ends.

The contagion threshold is set by the number entered in the “threshold” box. Entering 1 into the “search” box enables search, 0 turns it off (alternatively, pressing the purple Search button will execute ‘search’ once with the pre-defined probability, regardless of setting). The topology parameters N and p determine the size and structure of the network, respectively (the graphs above were all produced with a network of N=100 agents, and a rewiring probability p= .19, giving rise to a so-called small world network; see the Info Tab for more detail).

For more thorough exploration, the model can be downloaded together with NetLogo (which is free), and explored using NetLogo’s inbuilt “BehaviorSpace” which allows one to define experiments involving many runs (for instructions on how to use BehaviorSpace).

(U Hahn, Jan. 2/2023)

There is currently lively debate on Mastodon about whether or not to introduce QT's – a feature that many considered integral to the Twitter experience but that Mastodon has (to date) eschewed on a variety of grounds, most notably concerns about “dunking”, “pile ons”, and other mis-uses, but also notions of consent, wanting to promote conversation with people instead of about people, and so on. The point of this blog post is not whether Mastodon should or should not introduce QTs. The point is to consider some arguments on QTs floating around on the platform, and evaluate them from an argumentation theory perspective. Why that's useful or interesting, I'll say more about at the end; for now, I'll just offer the intuition that basing decisions on good arguments as opposed to bad ones is more likely to lead to good outcomes.

The four arguments (reasons) are:

  1. Blocking QTs won't stop bad behaviour

  2. QTs are being blocked without evidence that they promote harm

  3. We need QTs

  4. QTs will happen anyway, so we should focus on making them safe

I'll go through each of these in turn. All are interesting, and each analysis shows something quite different. Some will be more technical than others (1. and 2.), so do read on if the first two are too geeky, you might still find the next two more interesting. So, without further ado:

“Blocking QTs won't stop bad behaviour”

The argument is trying to establish a reason for why, causally, blocking QTs is not effective (in plain English, blocking QTs won't do what it’s meant to do).

For any such causal claim we can distinguish between necessary and sufficient conditions. Necessary conditions are ones that must be in place for an effect to occur; sufficient conditions are ones that are enough for an effect to occur.

QTs might or might not be sufficient (on their own) to generate 'bad behaviour', but nobody actually seems to be claiming that only QTs generate bad behaviour or that bad behaviour isn't possible without QTs (and such claims would be fanciful given that Mastodon instances still require moderation!).

In other words, QTs are clearly not necessary for bad behaviour, bad behaviour arises (also) through other means.

Given that everyone arguably understands this, the argument has a whiff of 'straw man' fallacy about it (that is, a spurious attempt to refute a position nobody actually holds, see e.g., Woods et al., 2004).

What is really required is evidence to assess the causal role of QT: to what extent is the ability to QT sufficient for generating bad behaviour? For causal claims, that ideally involves comparisons across cases with and without QT (see e.g., Lagnado et al., 2007).

This brings me to Argument 2 about the presence or absence of evidence.

Before that, though, a shout out to another reason why people might genuinely think Arg. 1 is better than it is, namely that it looks like a type of argument that is really strong ('modus tollens' – this might be too technical for most people's interests, so I've put it in an Appendix at the end). Onwards to Argument 2:

“QTs are being blocked without evidence that they promote harm”

There are two aspects to this argument: a) whether it is factually true (i.e., is there relevant evidence or not?) and b) if there is no such evidence (or only insufficient evidence) is this a good argument and why?

Is there really no evidence?

There is a relevant observation: Mastodon was designed to minimise certain types of bad behaviour, and, those types of bad behaviour do, indeed seem pretty rare. But what is really required is evidence that compares with and without QTs, and that is lacking in as much as any comparison, say, between Twitter and QT is confounded by many simultaneous changes. Furthermore (non-argumentation theory aside, here...) studying such systems and trying to identify causal effects and their magnitude is really hard, because online communication systems involve many interacting components that may modify each others effects. Science has rather limited tools, at present, to study such systems (see e.g., Bak-Coleman et al, 2021), and establishing test bed communities where one could properly conduct such research is a priority for many, but costly and difficult to realise (a CERN model for studying information environments, see e.g., Lewandowsky et al., 2020; Wanless & Shapiro, 2022). So we seem unlikely to have terribly compelling evidence in the short term. This makes the argument extremely relevant. But is it compelling?

Given insufficient evidence, is this a good argument?

The argument is about what argumentation scholars call the “burden of proof”. These are familiar from other contexts in which we have to ultimately make decisions, in particular from law. In law, a failure to meet a burden of proof will lead to dismissal of a case. Argumentation scholars have adopted the notion for many purposes. For example, it has been argued that the proponent of a claim holds the 'burden of proof” in as much as it is up to the proponent to provide evidence for a claim, when challenged (see e.g., van Eemeren and Grootendorst, 2004). Crucially, that is intended to be an epistemic burden of proof (trying to spell out what it means to 'win' the argument or 'what to believe' in light of the evidence). One can take issue with that idea of epistemic burden of proofs (see e.g., Hahn & Oaksford, 2007), and argue that what we should believe is what is congruent with the evidence (and if that evidence says, 'we really don't know', that's exactly what we should believe). Importantly, that's not what Argument 2 is, however. The argument isn't “you don't have evidence, so you shouldn't believe that QT is harmful”, the argument is about what should happen, given that we don't know. With what action the burden of proof should lie, and how high the burden of proof should be set, depends on all kinds of value judgments, not just facts and evidence about the world. In the case of criminal law, the burden of proof is set to reflect the fact that society thinks it is worse to wrongly convict an innocent person, than to wrongly set a criminal free.

Many different values and value judgments could be invoked to argue about who should hold the burden of proof vis a vis QT and why: you could argue that it's fair that the people who designed, built, and supported the space can require evidence that their intuitions are, in fact, wrong, and without that, everything stays the same. Or you could argue that online toxicity is so damaging that, unless we have good evidence that a feature doesn't promote it, we should avoid introducing that feature. Or you could argue that most people want QTs (I'm not claiming this is actually true), so we should do what people want in the absence of good reason to the contrary. None of these are arguments about facts (what is true or false). They are all arguments involving values (what we like or dislike, and by how much).

Because the argument isn't providing any reason for why the burden of proof should be a particular way (ie with “blocking” or “not blocking”), it doesn't address the issue in a real way.

It's, at best, a description of a state of affairs (“I don't see any real evidence, and QTs are being blocked”), and, at worst, an attempt to claim a burden of proof (“in the absence of evidence we should be going ahead”) without actually providing a reason for why that's the way the burden of proof should be (something argumentation scholars are likely to think is unfair).

“We need QTs”

Many versions of this argument, by many individuals and all manner of groups have appeared in my feed in the two months since joining here. The salient thing about “need” is its relation to “want”.

An argument for QT, or any other feature, is typically an explicit or implicit argument like this:

“I need X for Y”

there are consequently two needs in play here: whether one really needs X to achieve Y, and whether one needs Y itself.

The former (“need X for”) is about an instrumental relationship. It is an empirical question, and the claim that I “need X for” is false if there are other ways for me to achieve Y.

The latter (“need Y”) is fundamentally about importance, that is, a value judgment. It is false when the claimed importance doesn't match the actual importance to myself. And, if I don't need Y (I merely want Y), then I also don't actually 'need' X, even if X is genuinely the only way to get Y.

Y is also fundamentally about values in a further sense: in almost any interesting real world context, different needs conflict. (Here, the absence of QT wasn't an oversight, it was an intentional design choice for a particular goal, so, presumably, reflects a felt “need” to not have QT). The key underlying issue will thus be how these different competing needs should be resolved in this case, and why. This may include coming to agreement that some lesser feature W, which is less effective at bringing about Y, may nevertheless be preferable given the overall balance of needs.

“I need X for Y” is a potentially powerful argument for QT, but, stated as such it is also still just a claim. Further arguments or reasons are required to establish that need. So what better arguments legitimately can (and should) focus on is the empirical question about the effectiveness of X for causing Y (see above!), which can include both discussion of necessary and sufficient features, and, at the same time, better arguments can (and should) focus on good reasons for adopting a particular balance of needs.

“QTs will happen anyway, so we should focus on making them safe”

The first thing to note is that, far from being the exception, 'bad things happening anyway' is the normal way for bad things! We have vast legal systems (from civil law about contracts and what happens when people break them, through things like traffic violations, all the way to serious crimes) only because the bad things they try to regulate “happen anyway”. If they never happened we wouldn't need those rules, and if the rules totally stopped them, we wouldn't need to expend all that energy on specifying what happens when they are nevertheless broken, which is the majority of what the rules actually do.

As a result, basically nothing follows from the fact that QT is a potentially bad thing that will happen anyway, because there is nothing special about that fact.

For the vast majority of bad things, the way we try to deal with them is by defining clear rules to stop them and combining those with enforceable sanctions for violations that nevertheless occur. So why should QT be different? That is the argument that needs to be made.

There are, of course real world cases where we simultaneously maintain that a behaviour is harmful and sanctioned, and nevertheless make accommodation in practice to further reduce harm: examples might be how we regulate sex work, or the fact that one might ban certain drugs but still offer needle exchanges. These kinds of cases strike me as being characterised by the fact that potential 'perpetrators' (if we ban these behaviours) are simultaneously, in some sense, victims who are harmed. But all of that, and whether it has any bearing on QTs, is beyond the realm of purely evaluating arguments.

The argument evaluation point, here, is simply that nothing much, if anything, follows from “QTs will happen anyway”. As a result it doesn't add much of reason for thinking about “safer QTs”.

Thinking about safer QTs is already a project anyone is entitled to engage in, the same way that anyone is entitled to argue that QTs (or newly safer QTs) have benefits that outweigh the costs. Given that this is what any argument about competing needs or desires will ultimately come down to, it seems preferable to just provide arguments directly for that.

That avoids not just misdirection, but also the slight hint of trying to portray the debate as already closed, when argumentation theorists, at least, would still consider all of these issues up for legitimate debate (for more theoretically souped up versions of that sentiment in the form of a 'freedom rule' governing rational discourse, see e.g., van Eemeren and Grootendorst, 2004).

Who cares?

Finally, is there a point to all of this and who cares? Well, scholars interested in the question of whether or not we can say things about the quality of arguments that aren't just wholly subjective care. Are there things about argument quality one can say that go beyond subjective preferences (I like chocolate, you like vanilla)? Attempts such as the above are attempts to show that there is. The intuition that there should be, at least some, more objective criteria is widespread. In line with that intuition, people in a wide range of disciplines (education, psychology, critical thinking) care about arguments that seem more convincing than they are (also known as “fallacies”, see e.g., Woods et al., 2004; Hahn, 2020). And, on some level we probably all do. And while we may not care much if others are wrongly bamboozled by our own arguments, we generally don't want to be bamboozled ourselves. And, finally, we have strong intuitions that bad arguments are unlikely to lead to good decisions; and conversely, if something *is* really good there should be good arguments one can find for it (and there are ways to make those intuitions theoretically more precise).

So, focussing our efforts on trying to find good reasons for what we want to do, instead of getting caught up in bad ones, should hopefully benefit us all. And it seems particularly relevant when we are trying to build a community and negotiating what we want to do.

Appendix

Argument 1. might be confounding people also because it looks a lot like a logically valid argument (“modus tollens”):

If the snark is gorped, then the squaggle is glibbed.

The squaggle is not glibbed

therefore, the snark is not gorped

To say that this argument is logically valid is to say that one has to accept its conclusion if one accepts the premises (by necessity, otherwise one's position is self-contradictory), and that holds regardless of the specific content, just because of its form (which we can see by the fact that it is true even for nonsense words). So, from that perspective this looks like a really great (maximally great, in fact!) argument:

If we block QTs, we stop bad behaviour.

Bad behaviour isn't stopped

therefore, we don't block QTs

But that would be to misconstrue the argument and the claim – that argument 'works' in the case where we believe the premise “if we block QTs, we stop bad behaviour”, which is exactly what is being questioned here.

References

Bak-Coleman, J. B., Alfano, M., Barfuss, W., Bergstrom, C. T., Centeno, M. A., Couzin, I. D., ... & Weber, E. U. (2021). Stewardship of global collective behavior. Proceedings of the National Academy of Sciences, 118(27), e2025764118. https://www.pnas.org/doi/10.1073/pnas.2025764118

Eeemeren, F.H. van & Grootendorst, R. (2004). A systematic theory of argumentation. The pragma-dialectical approach. Cambridge: CUP.

Hahn, U. (2020). Argument quality in real world argumentation. Trends in Cognitive Sciences, 24(5), 363-374. https://www.sciencedirect.com/science/article/pii/S1364661320300206

Hahn, U. & Oaksford, M. (2007) The burden of proof and its role in argumentation. Argumentation, 21, 39-61. https://link.springer.com/article/10.1007/s10503-007-9022-6

Lagnado, D. A., Waldmann, M. R., Hagmayer, Y., & Sloman, S. A. (2007). Beyond covariation. Causal learning: Psychology, philosophy, and computation, 154-172. https://books.google.de/books?hl=en&lr=&id=5I4RDAAAQBAJ&oi=fnd&pg=PA154&dq=d+lagnado&ots=OmE23ZHia3&sig=VB9vFUD4Wjqwyl0gsGV_58HBBOg&redir_esc=y#v=onepage&q=d%20lagnado&f=false

Lewandowsky, S., Smillie, L., Garcia, D., Hertwig, R., Weatherall, J., Egidy, S., ... & Leiser, M. (2020). Technology and democracy: Understanding the influence of online technologies on political behaviour and decision-making. https://pure.mpg.de/rest/items/item_3277241/component/file_3277242/content

Wanless & Shapiro (2022) A CERN model for studying information environments https://carnegieendowment.org/2022/11/17/cern-model-for-studying-information-environment-pub-88408

Woods, J., Irvine, A., & Walton, D. N. (2004). Argument: Critical thinking, logic and the fallacies, Revised Edition. Toronto: Prentice Hall.

Read more...