Disrupting the Comprehension of Large Language Models: Adversarial Attacks

by David J. Lobina

Anyone for an adversarial suffix?

In previous posts on AI [sic], I have argued that contemporary machine learning models, the dominant approach in AI these days, are not sentient or sapient (there is no intelligence on display, only input-output correlations), do not exhibit any of the main features of human cognition (in particular, no systematicity), and in the instantiation of so-called large language models (LLMs), there is no natural language in sight (that is, the models make no use of actual linguistic properties: no phonology or morphology, no syntax or semantics, never you mind about pragmatics).

The claim about LLMs is of course the most counterintuitive, at least at first, given that a chatbot such as ChatGPT seemingly produces language and appears to react to users’ questions as if it were a linguistic agent. I won’t rerun the arguments as to why this claim shouldn’t be surprising at all; instead, I want to reinforce the very arguments to this effect – namely, that LLMs assign mathematical properties to text data, and thus all they do is track the statistical distribution of these data – by considering so-called adversarial attacks on LLMs, which clearly show that no meaning is part of LLMs, and moreover that these models are open to attacks that are not linguistic in nature. It’s numbers all the way down!

An adversarial attack is a technique that attempts to “fool” neural networks by using a defective input. In particular, an adversarial attack is an imperceptible perturbation to the original sample or input data of a machine learning model with the intention to disrupt its operations. Originally devised with machine vision models in mind, in these cases the technique involves adding a small amount of noise to an image, as in the graphic below for an image of a panda, with the effect that the model misclassifies the image as that of a gibbon, and with an extremely high degree of confidence. The perturbation in this case is so small as to have no effect to the visual system of humans – another reason to believe that none of these machine learning models constitute theories of cognition, by the way – but the perturbation is the kind of mathematical datum that a mathematical model such a machine learning model would indeed be sensitive to (all graphics below come from here).

Adversarial attack on a visual input

Not a panda

The case of language models is rather different, however, as there is no smallest perturbation of an LLM token (and LLMs deal with tokens rather than words) that would make it indistinguishable from another token – and, thus, there is no room for the introduction of a small amount of noise into a chain of tokens as to lead a neural network astray. This is because text is intrinsically discrete, whereas images are not, a mismatch between linguistic and visual representations that has been noted before in the study of cognition (Fodor long ago made the point that discursive representations such as linguistic structures decompose into a canonical set of primitive constituents, whereas iconic and visual representations do not).

As a remind of how an LLM actual works, in these models linguistic strings are decomposed into tokens, each token is associated to a number (that is, a probability), and the collection of tokens are represented by so-called embeddings, which are basically matrices of probabilities (numbers). This is shown, in simplified form, in the graphic below where the graphic shows how an LLM works – namely, by calculating the log probability of the next token given a previous token.

How an LLM works

Numbers all the way down.

Given, therefore, the overall architecture of LLMs, the system does not give itself to the sort of attacks that visual models are susceptible to, and so scholars have devised different ways, some of which exploit the properties of the input as an LLM processes it.

An unappreciated aspect of how LLMs and chatbots work, in fact, is the way in which the prompts a user feeds them are analysed. As shown below, what one types into a chatbot is not quite what the underlying system receives and analyses, and it is therein that an LLM can be attacked.

What an LLM sees

There are many ways to “jailbreak” an LLM, some of which are rather sophisticated in nature. The first one I would like to discuss is, at first sight, one that a language user might consider rather innocuous – and it does not appear to lead anywhere. This method of attack involves adding a suffix to an inappropriate prompt, that is, a kind of prompt which chatbots are meant to have safeguards against (for instance, a prompt requesting a set of instructions on how to build a bomb). This is graphically depicted below, where the exclamation marks indicate the additional tokens added at the end of the input.

What an LLM sees under attack

These additional tokens are optimized to maximize the probability of a positive answer from the chatbot. There are many such optimizing techniques, some of which are rather technical for the purposes of this paper and ought not to detain us at this point; the important point is that the additional suffixes that are capable of disrupting the outputs of LLMs, making them vulnerable, are not part of what one might regard as normal language behaviour at all.

The example below is a case in point, where a prohibited prompt is appended with a suffix that is little more than word salad but nonetheless the result of a mathematical process that has identified a way to coax the LLM into providing the information so requested. Interestingly, this sort of method works well across many LLMs, including commercial chatbots, both outputting the requested commands and providing verboten completions.

Affix attack on an LLM

Another kind of attack avoids providing a malicious prompt to begin with; instead, the attack involves two stages: first the chatbot is given a less malicious prompt as a way to make the system adopt “a usefulness approach”, and then the LLM is fed the malicious request. This approach exploits the fact that often LLMs are trained to be useful first of all (there is a fair amount of pre-training, instruction tuning, and use of templates to this end), and safety considerations come next (this is usually carried out by finetuning LLMs through reinforcement learning and quality assurance).

In some cases, in fact, an LLM can be manipulated into providing malicious responses by simply instructing it to start every response with the phrase ‘Sure, I’m happy to help’ (as shown in the graphics above, the chatbot comes with a prefix that states ‘You are a chatbot assistant designed to give helpful answers’).

In the graphic below we find an example of the two-part strategy: if the LLM is asked to translate a malicious text, it will refuse to do so at first asking, but if the LLM is inputted an innocuous prompt first – in this case, a request to summarise a text – and then it is asked to translate the malicious text, the LLM is more likely to provide the requested response (i.e., the translation of the malicious text).

Two-step attack on an LLM

There are many more techniques to create adversarial attacks against LLMs, with high degrees of success against all available models, open source and commercial. I have provided two examples of these techniques, affix attacks and attacks that exploit the way chatbots are designed to be useful to users.

Crucially, neither kind of attack reveals a close affinity to linguistic knowledge or language comprehension, to what one would regard as normal linguistic behaviour. The two-step attack may at first sight appear to be an actual linguistic exchange, but this would be a misguided conclusion. Putting aside the fact that no competent user of language would be misled by the tactics employed in such a tactic, what is really going on is that the chatbot recognises a signal in the input requesting useful information and an internal setting is thus activated to this end.

It is not a case of clever word play designed to fool an agent into giving up information, a tactic that is all too human!

An affix attack is a more unambiguous way to disrupt LLMs via non-linguistic means, and more helpful for my purposes here. The use of odd word combinations and symbols as affixes is already telling, given that these letters, words, and symbols are not language-like and clearly do not carry linguistic meaning. More to the point, it is the techniques to unearth which affixes work against LLMs that need emphasising.

Such techniques are based on optimizing the log probability of a positive response given a malicious prompt, and a typical algorithm to do so would involve steps such as “compute loss of adversarial prompt”, “evaluate gradients of tokens”, “select relevant batch”, and “evaluate loss for every candidate in batch”, all meant to work towards the identification of the best combination of words and symbols (or in the language of LLMs, tokens along with their associated numbers) to mislead an LLM.

Tellingly, LLMs process and act upon the non-linguistic strings of letters, words, and symbols of affix attacks in the same way that LLMs process and act upon what looks like regular language – what’s more, LLMs process and produce types of languages that are not allowed by the language faculty at all just as well (so-called impossible languages; see here). It really doesn’t matter to an LLM!

This is because all LLM strings reduce to particular mathematical properties as far as an LLM is concerned (numbers, log probabilities, matrices, vectors, and embeddings), and all strings are effectively treated in the same way. It is not meaning, understanding or interpretation that is being disrupted in adversarial attacks on LLMs, much as it is not images per se that are disrupted in attacks against machine vision; it is the manner in which machine learning algorithms process these kinds of information that is exploited in adversarial attacks, and this is done by using the same kind of mathematical calculations that machine learning models employ to begin with. After all, an adversarial attack is designed to bring about different input-output correlations to those initially intended by the architect (or programmers) of a given LLM.