Publication:

Misuse of AI Through Adversarial Attacks

Loading...
Thumbnail Image

Date

2025-03-14

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Bailey, Luke. 2024. Misuse of AI Through Adversarial Attacks. Bachelors Thesis, Harvard University Engineering and Applied Sciences.

Abstract

The advent of new generative AI models, such as OpenAI’s GPT-4, has seen a huge increase in AI capabilities. With this, there has been a corresponding increase in the ways in which such models can be used to commit harm by malicious actors, which we refer to as misuse. One such method to misuse models is through adversarial attacks. Informally, these are inputs to AI models that cause said models to act in ways that the developers did not intend. Most AI systems require users to, at the very least, have access to their input. Thus, if adversarial attacks for misuse exist, most systems are vulnerable. This makes adversarial attacks a natural and important setting to study AI misuse. The central question of this thesis is: can harm be committed with current AI models through adversarial attacks? We decisively answer this in the affirmative, showing that across a wide range of settings, malicious users can craft inputs that cause current AI models to misbehave in alarming and harmful ways.

In Chapter 2, we introduce the novel Interaction Context Framework, a collection of formal definitions for AI systems, model behaviors, and adversarial attacks, specifically designed to aid in the study of AI misuse. Notably, this is the first formalism to unify three previously disparate notions of adversarial attacks: adversarial attacks to discriminative models, prompt injection attacks to generative models, and jailbreaking of generative models.

In Chapter 3, we explore adversarial attacks when the user has access to the underlying AI model. We present the novel Image Behavior Matching Algorithm for creating images that force multimodal LLMs to display four different types of harmful behaviors. We also extend existing text-based attacks to these behaviors.

In Chapter 4, we consider the setting in which users can only access the AI’s input and output. We create Tensor Trust, an online game that pits players against one another in an adversarial attack and defense contest. Using this, we collect and release a dataset of 128,808 adversarial attacks and 46,457 defenses, as well as two corresponding robustness benchmarks. We evaluate current AI models on these benchmarks and find all perform poorly.

Work done in Chapter 3 was conducted jointly with Euan Ong, Professor Stuart Russell, and Scott Emmons. Work done in Chapter 4 was conducted jointly with Sam Toyer, Olivia Watkins, Ethan Mendes, Justin Svegliato, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell.

Description

Other Available Sources

Research Data

Keywords

Artificial intelligence

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories