Publication:
Guided Proof Search Using Large Language Models and Lemma Extraction in Coq

No Thumbnail Available

Date

2024-06-12

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Prasad, Tarun. 2024. Guided Proof Search Using Large Language Models and Lemma Extraction in Coq. Bachelor's thesis, Harvard University Engineering and Applied Sciences.

Research Data

Abstract

Interactive theorem provers are powerful tools for formalizing mathematics and verifying the correctness of software. However, they require significant background and effort to use due to the tedious nature of writing formal proofs and have not seen widespread adoption among either mathematicians or software engineers. Automated theorem provers aim to address this problem by automating the search for proofs, reducing the amount of human effort required. Recent advances in machine learning have shown that large language models can generate high-quality outputs in a variety of domains, including that of formal proofs. Most previous approaches that use language models, however, have focused on generating individual proof steps and using them in conjunction with an expensive search algorithm to find proofs. In 2023, First et al. introduced Baldur, a system that uses language models to generate entire proofs at once, instead of step-by-step, for theorems in the Isabelle proof assistant. This thesis studies the feasibility of a similar whole-proof generation procedure for the Coq proof assistant and introduces a novel approach to automated theorem proving that recursively extracts lemmas at failure points in the proof generation process, allowing the system to break complex theorems down into simpler subproblems. We evaluate these approaches on a dataset of 724 theorems from the Software Foundations textbook and show that GPT-4 can generate whole-proofs for 66.44% of the theorems. Additionally, when augmented with our lemma extraction method, GPT-4 sees a 19.54% improvement to achieve a success rate of 79.42%, thus marginally outperforming CoqHammer—a state-of-the-art automated reasoning tool—which proves 78.73% of the theorems. We also evaluate the much smaller open-source model Phind CodeLlama, which depicts a 103.23% improvement over its baseline when utilizing lemma extraction. We release our Coq playground that contains an implementation of this procedure along with the dataset and evaluation results through an open-source repository to encourage further research in this area.

Description

Other Available Sources

Keywords

automated theorem proving, formal proofs, formal verification, interactive theorem proving, large language models, proof assistants, Computer science, Mathematics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories