Publication: DafnyBench
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
It is said that “program testing can be used to show the presence of bugs, but never to show their absence”. As Large Language Models are increasingly used to generate code, it is important that we verify the absence of bugs in their outputs. For this reason, we introduce DafnyBench, a novel benchmark for evaluating Large Language Models' abilities to write Dafny code, which is a verification-aware programming language. With over 1,000 sample programs, DafnyBench represents the largest benchmark of its kind to date. We argue this benchmark represents an important piece of infrastructure for an emerging field: AI-assisted formal verification. The benefits of having Large Language Models write verification-aware code cuts in two directions: the Dafny verifier provides a rich stream of feedback and clear learning signal for what constitutes correct code with respect to a set of pre- and post- conditions (helping machine learning research efforts), and also greater prevalence of Dafny code generally leads to a higher amount of safe software in the world.