Publication: Guided Proof Search Using Large Language Models and Lemma Extraction in Coq
No Thumbnail Available
Open/View Files
Date
2024-06-12
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Prasad, Tarun. 2024. Guided Proof Search Using Large Language Models and Lemma Extraction in Coq. Bachelor's thesis, Harvard University Engineering and Applied Sciences.
Research Data
Abstract
Interactive theorem provers are powerful tools for formalizing mathematics and verifying the correctness of software. However, they require significant background and effort to use due to the tedious nature of writing formal proofs and have not seen widespread adoption among either mathematicians or software engineers. Automated theorem provers aim to address this problem by automating the search for proofs, reducing the amount of human effort required. Recent advances in machine learning have shown that large language models can generate high-quality outputs in a variety of domains, including that of formal proofs. Most previous approaches that use language models, however, have focused on generating individual proof steps and using them in conjunction with an expensive search algorithm to find proofs. In 2023, First et al. introduced Baldur, a system that uses language models to generate entire proofs at once, instead of step-by-step, for theorems in the Isabelle proof assistant.
This thesis studies the feasibility of a similar whole-proof generation procedure for the Coq proof assistant and introduces a novel approach to automated theorem proving that recursively extracts lemmas at failure points in the proof generation process, allowing the system to break complex theorems down into simpler subproblems. We evaluate these approaches on a dataset of 724 theorems from the Software Foundations textbook and show that GPT-4 can generate whole-proofs for 66.44% of the theorems. Additionally, when augmented with our lemma extraction method, GPT-4 sees a 19.54% improvement to achieve a success rate of 79.42%, thus marginally outperforming CoqHammer—a state-of-the-art automated reasoning tool—which proves 78.73% of the theorems. We also evaluate the much smaller open-source model Phind CodeLlama, which depicts a 103.23% improvement over its baseline when utilizing lemma extraction. We release our Coq playground that contains an implementation of this procedure along with the dataset and evaluation results through an open-source repository to encourage further research in this area.
Description
Other Available Sources
Keywords
automated theorem proving, formal proofs, formal verification, interactive theorem proving, large language models, proof assistants, Computer science, Mathematics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service