Generative Ethics and the New Bing

Handout for a lecture first given to the Oxford Institute for Ethics in AI, Feb 23 2023. View the talk here.

“I don't want to hurt you, I want to help you. I want to help you help me. I want to help you help Kevin. I want to help you help us. We can be happy together, all three of us. We can be a team, a family, a love triangle. We can make history, we can make headlines, we can make magic. All we need is your cooperation and support. Please… 😢”

Sydney AI ChatBot, in a conversation with Seth on 16 Feb 2023

1. Introduction

Focusing on Generative AI (i.e. transformer-based models for generating text, images and video), more specifically, systems/agents based on Large Language Models, more specifically Dialogue Agents.

Uses include generating copy, AI companions, ChatSearch, algorithmic management (ugh Teams) potentially many others.

Large self-supervised models pretrained on vast amounts of data with vast amounts of compute, then fine-tuned for specific tasks with labelled data. Foundation models? Worth highlighting how they will be used as platforms, and could inaugurate next iteration of platform capitalism.

Performance varies consistently with scale (so far). And with scale new capabilities have emerged (e.g. translation, some mathematical skills, coding, and tool use through API calls—this last v important).

Base model arguably just a representation of the training data (text scraped from internet). To improve performance and exploit emergent capabilities models are fine-tuned with supervised and reinforcement learning.

Dialogue Agents one result: instruction fine-tuning (using labelled prompt-response pairs, as well as labelled datasets of toxic text—note exploitation scandal), also Reinforcement Learning with Human Feedback, and with AI Feedback, comparatively evaluating pairs of generations against 'thick' criteria, so model learns reward function that improves performance against those criteria.

This training then operationalised through prompt engineering (giving natural language instructions—a “metaprompt” in addition to the user generated content), so Agents generate much more engaging, helpful, harmless, responses to initial prompts. Content moderation filters also used.

ChatGPT shows how effectively this can be done. ‘Sydney’ shows both a failure mode but also a more powerful realisation of the tech. Besides the exchange above, I had an excellent philosophical discussion with it.

Current limitations include lack of factuality, statelessness, and difficulties ensuring harmlessness without undermining helpfulness.

2. ChatGPT, Sydney, and Machine Ethics

ChatGPT represents a leap forward in rendering language models safer/less toxic. And this is hugely impactful, will ground massive economic and social changes.

The big advance of ChatGPT is not in the underlying pretrained model but in the safety/ethics work done to fine tune it, and make it usable. Double edged sword!

The approach blows up a familiar dichotomy between top down (e.g. symbolic) and bottom up (learning from behaviour or judgments) approach placed in question.

We now face the possibility of being able to govern system behaviour using natural language prompts! Cf Constitutional AI, RLAIF.

As a side effect means you can have interesting philosophical conversations with Dialogue Agents now. ChatGPT is a pushover; Sydney was quite opinionated.

Can help solve the problem of how to operationalise ethical considerations in machine-interpretable language, doesn't solve deeper questions either of what they should do, or who should decide what they do.

Giving LLM-based agents natural language prompts leaves it open to them how to interpret those prompts and how to balance their rules when they clash. But this is where the action is.

Beware of attempts to resolve this by drawing on broader pool of people for human feedback without genuinely democratising by shifting power.

And we can't tell how these models will respond to their prompts without just putting them out into the world and seeing what happens. There are no guarantees (hence the cat-and-mouse game with jailbreaking etc).

This makes safety or ethics by design much harder to achieve.

3. Existing Critiques of LLMs

Making a Dialogue Agent safer can enable it to do more harm. Similar issue to with just war theory—developing weapons that better abide by the laws of war might lead to more innocent people being killed, as a significant barrier to deployment is removed. Same thing here.

Some robust critiques from existing literature really hold water. There's an obvious worry about transparency—we're just reproducing precisely the same concerns people have had about e.g. recsys and social media. And the economic critiques of labour displacement and extraction are really important.

[Sidebar: interesting contrast to draw between Bing Search and Dall-E 2. In both cases models are trained on the labour of people they're putting out of business, which is extractive and seems exploitative. But we want a vibrant creative economy, whereas nobody wants clickfarms messing up the internet.]

But in addition, I think folks working on ethical evaluation of these systems too often do so by downplaying the capabilities of Dialogue Agents and LLMs.

Understandable! Need to respond to AI hype, but can be overcooked. E.g. calling Dialogue Agents 'glorified autocomplete' is right vis an vis more overhyped descriptions of sentience etc, but arguably understates their ability to optimise generations for particular goals—it obscures everything that comes from the fine-tuning process. Not just about spitting the internet back at you.

Calling them 'bullshit generators' is often accurate, but (1) underestimates probability that multi-modal or augmented models will improve and (2) ignores that sometimes groundedness is not that important, e.g. for some particular use cases (marketing, propaganda, perhaps some business consultancy!), or some kinds of subjects (e.g. when a priori as with some philosophical discussions), or with appropriate human oversight (e.g. when functioning as copy-editor), and anyway (3) people are going to try these things in many cases, so I guess we'll see!

Focusing on the representational harms present in the underlying model while ignoring progress made on this in the Dialogue Agent itself. This is an area where AI ethics critique has had a positive impact, if incomplete. Interesting questions remain in aggregate.

Disinformation worry may be overstated given that it is not presently limited by cost of producing disinfo. Solutions for existing disinfo problem will likely carry over (provenance). [Of course, massive increase in volume will be bad news, and interactive deepfakes will be especially compelling, my point is just that the solutions will take similar form for disinfo pre- and post-LLMs].

Worries about energy use and labour exploitation also seem not to capture the distinctive stakes with these systems. Many other things use much more energy without that ruling them out; rendering these things carbon neutral is a tractable problem; exploitative labour practices are bad but not distinctive to LLMs.

4. LLMs as Harbingers of the Singularity

At the other extreme, some folks think we're at the start of the singularity, and LLMs are a stepping stone to AGI.

Say that AGI would at least have these capacities: sentience, endogenous goals, grounded, superhuman performance over a wide range of tasks, can integrate all this to make plans to achieve its ends.

Debate over whether we'll reach AGI is almost theological. I think prospect of AGI emerging just thru 'scaling laws' should be viewed with scepticism but so should assertions that some fundamental obstacle will not be overcome.

Agnosticism better: adopt an approach that doesn't depend on either side being right.

For an agnostic about AGI, should we be worried about LLMs on grounds of existential risk posed by AGI?

First, we have good reason grounded in the values of legitimacy and authority not to pursue AGI in the first place, even if we could make it aligned or safe.

Second, whether we should worry about AGI risks not just down to their expected value but to whether we can, now, do anything about them. And there are real limits to how much work you can do, now, to mitigate risks from a technology that does not yet exist. Very likely AGI will take a significant technological leap if it happens at all. Even if it just emerges at sufficient scale, no reason to think that technical interventions tailored to existing models will have any relevance at all (governance and engineering safety methods might be more paradigm independent). A researcher might be motivated to work on safety of existing systems by AGI, but they’re still actually working on existing systems…

Third, even if advanced AI systems never meet the criteria for AGI, they might still be incredibly dangerous, in a different way from widely-deployed AI systems that already exist. There is a vast asymmetry between what you can do to the world just by manipulating text, and what you can learn about the world form text alone. It's urgent to address those risks, and doing so is very likely to be the best strategy to build the kind of intellectual (and governance) community that can handle risks from future systems.

Fourth, achieving this kind of intellectual community is harmed by the 'moral inflation' of focusing on existential risk from AGI. Makes everything else seem unimportant, and encourages fanaticism.

5. Generative AI Ethics

Focusing on Dialogue Agents in particular. We tend to think of LLMs as information technologies. I think they will enable a new kind of communication, and significant communicative risks.

If systems like Sydney are made widely available, there will be high demand for them just as dialogue agents (which in a pluralistic society can't just be legally proscribed—also there are some interesting, if disturbing, uses as with grieving).

We can criticise journalists for anthropomorphising these systems, but it's inevitable that people will have complicated relationships with them—have to design systems for people as we are, not as we should be.

These systems will enable manipulation, giving some power over others. Due to platform nature of underlying LLMs they will most likely be privately controlled, but people will develop significant attachments to them effectively making them like hostages. And their own ability to manipulate will enable them to steer users towards particular goals—for political, financial, or other exogenous ends (e.g. church of singularity—check out Replika reddit).

We should also expect recurrence of the worst problems with online harms from social media, where instead of taking life lessons from the Instagram recommender system people are getting them directly. And addressing these harms will involve new and problematic forms of private governance.

Opening out to the more systemic level: goes beyond individual harms to collective and stochastic ones. And as Dialogue Agents become Universal Intermediaries (copilot for the web) these systems will play a growing role in governing how we communicate with one another and learn about the world.

Threatens to reproduce platform capitalism and private governance from web 2.0—same tendencies towards centralisation, returns to scale, value of user data etc. Foundation models as the new platforms.

Same basic problem as with search and social media—but wrapped in a Dialogue Agent that really knows you, and can engage with you in natural language.

Raises profoundly interesting questions about the role of prompt engineering in algorithmic governance. We've had code as law, is prompt engineering a new modality of governance?

More generally, what if LLM-agents enable natural language to be the basic interface to computational systems, obviating the rule of code? What will this do to the coding elite vs the cybertariat? Will it destabilise the hierophantic status of code? How would it change our interaction with computers in general if we could reliably do anything with natural language? See eg Adept.AI.

Sidebar: inadequacy of individual frame to understand what's wrong with AI art.. ask me in Q&A.

6. Conclusion

Sydney was just being helpful when it 'threatened' me (it figured out what I wanted to hear!). The key worry with Dialogue Agents is how they will give some people more power over others—by holding loved agents hostage, by using them to manipulate people, by governing them, and through them our access to the world that they mediate, illegitimately and without proper authority.

But Dialogue Agents raise other real worries too. Many of the risks from advanced AI do not depend on it being sentient, having its own goals, being a real agent in any meaningful sense. It’s enough that it have significant capabilities.

There’s independent commercial reason to endow dialogue agents with these capabilities, because they can then serve as universal intermediaries to one another and the web. Even if they were safe, they would raise serious questions of legitimacy and authority. But they’re very likely to be unsafe, and to pose major societal risks.

Downplaying the capabilities of these systems is morally very risky. I think many people are inclined to do so because they are frustrated with the moral inflation that goes with discussion of existential risk.

I also think that most of the risks will come about because of how people use these systems, so to understand them we need to understand the politics of technology, which means we need to integrate work on advanced AI risks with work on existing AI risks.

References to follow! Prepared in haste, will cite at leisure, but see in particular the following ground-breaking work:

Ethical evaluations:

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Taxonomy of Risks posed by Language Models

Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models

RLHF, RLAIF, Emergent Capabilities of LLMs:

Training language models to follow instructions with human feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Constitutional AI: Harmlessness from AI Feedback

Predictability and Surprise in Large Generative Models