Publication:

Automated and Flexible Stress-Testing for the Robustification of Large Language Model Systems

Loading...
Thumbnail Image

Date

2025-06-24

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Tang, Leonard. 2024. Automated and Flexible Stress-Testing for the Robustification of Large Language Model Systems. Bachelors Thesis, Harvard University Engineering and Applied Sciences.

Abstract

Large language models are incredibly powerful but incredibly brittle and unreliable computing ob- jects. The same models that can convincingly generate rap lyrics in the style of Kanye West also struggle to consistently perform basic arithmetic accurately. The same models that can pass the interview bar at Amazon also struggle to ground their responses in facts. These models are a walking contradiction. As these systems become more embedded into our everyday lives, the key question becomes if we can properly trust and robustify these systems. The perspective that we adopt in this thesis is that, in or- der to prevent these systems from failing in high-stakes settings, we must preemptively discover all the ways in which they can fail. That is, we desire very powerful evaluation, red-teaming, and stress- testing technologies and tooling for language models. To this end, this thesis introduces the project of automated and flexible stress-testing and develops 1) REALM, a comprehensive robustness bench- mark and publicly hosted leaderboard with twenty-six hosted models in partnership with the biggest machine learning platform company in the world; 2) a suite of three novel and tailored red-teaming algorithms alongside three case studies of their successful application against in-the-wild LLM use cases; and 3) a multiagent stress-testing framework that universally jailbreaks five state of the art large language models that have been specifically finetuned to prevent jailbreaks. It is our hope that the technology developed here can enhance the conversation around safe, responsible, and secure AI de- velopment with regards to practical failure modes and correspondingly the methods to prevent them. In particular, with the technology developed here, all language model researchers, developers, users, businesses, and stakeholders will be able to discover failure cases before they arise in production. This brings us one step closer to the dream of robust language model systems.

Description

Other Available Sources

Research Data

Keywords

Computer science, Mathematics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories