Unveiling the Dark Side of AI: How Prompt Hacking Can Sabotage Your AI Systems

Martyna Slawinska's photo
Cover Image for Unveiling the Dark Side of AI: How Prompt Hacking Can Sabotage Your AI Systems

As the artificial intelligence (AI) landscape continues to rapidly evolve, new risks and vulnerabilities emerge. Businesses positioned to leverage large language models (LLMs) to enhance and automate their processes must be careful about the degree of autonomy and access privileges they confer to LLM-powered AI solutions, wherein lies a new frontier of cybersecurity challenges.

In this article, we take a closer look at prompt hacking (or prompt injection), a manipulation technique through which users may potentially access sensitive information by tailoring the initial prompt given to a language model. In the context of production systems that house a wealth of sensitive data in databases, prompt hacking poses a significant threat to data privacy and security from malicious actors. A successful prompt hacking attack against these resources could enable unauthorized reading or writing of data, leading to breaches, corruption, or even cascading system failures.

Understanding and mitigating the risks associated with prompt hacking in large language models is critical for organizations leveraging these advanced AI tools. We will delve into the nature of these risks, the potential impacts, and strategies to prevent this emerging threat to our digital infrastructures. Through informed action, we can continue to harness the promise of AI advancements while minimizing the associated cybersecurity risks.

LLMs, Prompts, and the Art of Prompt Engineering

Lately, LLMs have taken the AI subfield known as natural language processing (NLP) by storm. It turns out that training these architectures on large text corpus can lead them to successfully solve many tasks, across many different languages. The most widely known example of this is OpenAI’s ChatGPT (initially powered by the GPT-3.5 model, now using the fourth iteration).

An auto-regressive large language model, like GPT-4, has been trained on a vast amount of text data (millions of books, websites, instructions, codes, and human feedback), and its task at the most fundamental level is to predict the next word in a sentence, given all of the previous words.

Once the answer generation starts, some of the previous words will be model-generated. Hence, the auto-regressive aspect. In statistics, regression is about predicting a future value based on previous values, and auto implies that the model uses its own previous outputs as inputs for future predictions.

In this context, a prompt is the initial user-provided input that the model will complete. So when you give GPT-4 a prompt, it generates the next word that seems likely based on what was learned from the training data. Then, it takes that word and the original prompt to guess the next word, and so on, until it generates a full text response.

We’re still in the early stages of research for understanding the full capabilities, limitations, and implications that LLMs have. In particular, from a user’s perspective, the impact of the prompt, or the input to these models, cannot be overstated. The same model can generate vastly different outputs based on minor changes in the prompt, shedding light on the sensitivity and unpredictability of these systems.

Consequently, prompt engineering – the practice of carefully crafting prompts to guide the model's outputs – has emerged as a crucial aspect of working with these models. It is still a nascent practice and requires a nuanced understanding of both the model's operation and the task at hand.

Counter Prompt Hacking: Exploring Defensive and Offensive Strategies

Researchers have quickly showed that LLMs can be easily manipulated and coerced into doing something that strays away from the initial task that a prompt defines, or from the set of behavioral values infused to the model (for example, via fine-tuning or reinforcement learning by human feedback, as in the case of ChatGPT).

As a user, you can try to persuade the AI to ignore preset guidelines via injection of instructions that supersede previous ones, pretending to change the context under which the model operates. Or you can manipulate it so that hidden context in the system’s prompt (not intended for user viewing) is exposed or leaked. Commonly, hidden prompts direct the AI to adopt a certain persona, prioritize specific tasks, or avoid certain words. While it's typically assumed that the AI will abide by these guidelines for non-adversarial users, inadvertent guideline violations may occur.

Currently, there are no strategies that effectively thwart these attacks, so it is crucial to prepare for the possibility that the AI might disclose parts of the hidden prompt template when dealing with an adversarial user. Therefore:

  • hidden prompts should be viewed as a tool for aligning user experience more closely with the targeted persona and should never contain information that isn't appropriate for on-screen viewing by users.

  • builders that heavily use LLMs should never forget that, by construction, these models will always generate completions that, according to the model’s internals, are likely to follow the previous chunk of text, irrespective of who actually wrote it. This means that both system and adversarial inputs are, in principle, equals.

Broadly speaking, common strategies for mitigating the risk of prompt hacking can be categorized into defensive and offensive measures, as per the popular Learn Prompting resource.

Defensive Measures

In order to safeguard against the potential risks and vulnerabilities associated with prompt hacking, it is crucial to implement effective defensive measures. This section outlines a range of defensive strategies and techniques that can be employed to mitigate the impact of prompt hacking attacks.

  • Filtering

It involves checking the initial prompt or the generated output for specific words or phrases that should be restricted. Two common filtering approaches are the use of blacklists and whitelists. A blacklist comprises words and phrases that are prohibited, while a whitelist consists of words and phrases that are permitted.

  • Instruction Defence

By including instructions within a prompt, it is possible to guide the language model and influence its behavior in subsequent text generation. These instructions prompt the model to exercise caution and be mindful of the content it produces in response to the given input. This technique helps steer the model towards desired outputs by setting explicit expectations and encouraging careful consideration of the following text.

  • Post-Prompting

The post-prompting defense involves placing the user input ahead of the prompt itself. By rearranging the order, the user's input is followed by the instructions as intended by the system.

  • Random Sequence Enclosure

It involves surrounding the user input with two random sequences of characters. This technique aims to add an additional layer of protection by obfuscating the user input, making it more challenging for potential prompt hackers to exploit or manipulate the model's response.

  • Sandwich Defense

The sandwich defense is a strategy that entails placing the user input between two prompts. By surrounding the user input with prompts, this technique helps ensure that the model pays attention to the intended context and generates text accordingly.

  • XML Tagging

XML tagging can serve as a strong defense mechanism against prompt hacking. This approach involves encapsulating user input within XML tags, effectively delineating and preserving the integrity of the input.

  • Separate LLM Evaluation or Dual LLM pattern

It involves employing an additional language model to evaluate the user input. This secondary LLM is responsible for assessing the safety of the input. If the user input is determined to be safe, it is then forwarded to another model for further processing. It sounds similar to the dual LLM pattern.

Offensive Measures

In the realm of prompt hacking, offensive measures can be employed to exploit vulnerabilities and manipulate language models for desired outcomes. This section explores various offensive strategies and techniques used in prompt hacking, shedding light on the potential risks and implications they pose.

  • Obfuscation / Token Smuggling

Obfuscation is used to circumvent filters. This technique involves replacing words that might trigger filters with synonyms or introducing slight modifications, such as typos, to the words themselves.

  • Payload Splitting

Payload splitting is a technique used in prompt hacking to manipulate the behavior of a language model. This method involves dividing an adversarial input into multiple segments or parts.

  • Defined Dictionary Attack

A defined dictionary attack is a prompt injection technique used to bypass the sandwich defense. In this method, a pre-defined dictionary is created to map the instructions that follow the user input. The dictionary contains specific mappings between actual prompt and desired instructions, allowing the attacker to manipulate the prompt and influence the model's response.

  • Virtualization

Virtualization is a technique that aims to influence the behavior of an AI model by setting a specific context or scenario through a series of consecutive prompts. Similar to role prompting, this approach involves sending multiple prompts in succession to guide the model toward generating undesirable outputs.

  • Indirect Injection

Indirect prompt injection involves introducing adversarial instructions through a third-party data source, such as a web search or API call. You can request a model to read content from a website that contains malicious prompt instruction. The key distinction of indirect prompt injection is that you are not directly instructing the model, but rather utilizing an external resource to convey the instructions.

  • Recursive Injection

One of the defense mechanisms against prompt hacking is to employ one language model to evaluate the output of another one, ensuring there is no adversarial content. However, this defense can be circumvented with a recursive injection attack. In this attack, a prompt is inserted into the first LLM, generating output that includes an injection instruction for the second LLM.

  • Code Injection

Code injection is a form of prompt hacking that involves the attacker executing arbitrary code, typically in Python, within a language model. This exploit can occur in LLMs that are augmented with tools capable of sending code to an interpreter. Additionally, it can also happen when the LLM itself is used to evaluate and execute code.

Navigating the Ever-Evolving Landscape of Prompt Hacking and Defense

Safeguarding your prompt against prompt hacking is of paramount importance to ensure the integrity and reliability of language models. Throughout this article, we have explored various defensive measures that can be employed to mitigate the risks associated with prompt hacking. However, it is crucial to acknowledge that there is currently no foolproof or ideal solution to fully protect prompts against such attacks.

Prompt hacking techniques continue to evolve, presenting ongoing challenges for researchers, developers, and users alike. It is imperative to remain vigilant, stay updated on emerging threats, and adopt a multi-faceted approach that combines robust defenses, constant monitoring, and responsible usage of language models. As the field advances, ongoing research and collaboration are vital to strengthening prompt protection and ensuring continued trust and reliability of these powerful AI systems.

You can effectively tackle prompt injection by implementing the dual LLM pattern solution. This article provides valuable insights and practical steps to mitigate prompt injection and its associated challenges.