Publication: Improving Profiling Techniques for Pipeline Parallelism in Distributed AI Model Training
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
As the capabilities of AI models continue to grow exponentially, so does their scale and complexity. Training such large models is very costly in both memory consumption and runtime, so optimizing over GPU clusters only becomes more important as size increases. With the number of GPUs required for such large models, the financial cost is a significant burden. Differences between models mean that it is essential to choose the most efficient algorithm to minimize this cost. This problem is so complex that modern day training still involves numerous costly experiments to determine the optimal training parameters to balance memory consumption, runtime, and financial cost. For large models even just parameter tuning can result in hundreds of thousands of dollars in financial cost, not to mention the time to devise and run the experiments. Here, we provide an efficient, accurate, and unintrusive memory and runtime estimation tool that works with the most advanced training algorithms we have today. As the cost of parameter experiments comes from the many partial profiling runs of the model that are required, an accurate memory and runtime estimation tool that does not need to run the model is extremely valuable. This allows us to choose the optimal training algorithm for a given model while eliminating the costly parameter search. Prior works have only implemented this for a limited subset of models, specifically for limited types of parallelism. Thus our solution provides an extended memory and runtime profiling tool that supports the use of pipeline parallelism, a technique used in many state-of-the-art training algorithms.