Publication:

Improving Profiling Techniques for Pipeline Parallelism in Distributed AI Model Training

Loading...
Thumbnail Image

Date

2025-05-16

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Boulware, Sophie. 2025. Improving Profiling Techniques for Pipeline Parallelism in Distributed AI Model Training. Bachelors Thesis, Harvard University Engineering and Applied Sciences.

Abstract

As the capabilities of AI models continue to grow exponentially, so does their scale and complexity. Training such large models is very costly in both memory consumption and runtime, so optimizing over GPU clusters only becomes more important as size increases. With the number of GPUs required for such large models, the financial cost is a significant burden. Differences between models mean that it is essential to choose the most efficient algorithm to minimize this cost. This problem is so complex that modern day training still involves numerous costly experiments to determine the optimal training parameters to balance memory consumption, runtime, and financial cost. For large models even just parameter tuning can result in hundreds of thousands of dollars in financial cost, not to mention the time to devise and run the experiments. Here, we provide an efficient, accurate, and unintrusive memory and runtime estimation tool that works with the most advanced training algorithms we have today. As the cost of parameter experiments comes from the many partial profiling runs of the model that are required, an accurate memory and runtime estimation tool that does not need to run the model is extremely valuable. This allows us to choose the optimal training algorithm for a given model while eliminating the costly parameter search. Prior works have only implemented this for a limited subset of models, specifically for limited types of parallelism. Thus our solution provides an extended memory and runtime profiling tool that supports the use of pipeline parallelism, a technique used in many state-of-the-art training algorithms.

Description

Other Available Sources

Research Data

Keywords

Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories