glake icon indicating copy to clipboard operation
glake copied to clipboard

Comparison with expandable_segments in pytorch/c10?

Open YouJiacheng opened this issue 2 years ago • 3 comments

https://github.com/pytorch/pytorch/pull/96995

https://github.com/pytorch/pytorch/blob/95a86ed9ca107329151e0dc172386d50dd3471c6/c10/cuda/CUDACachingAllocator.cpp#L311-L324

The expandable_segments:True option is used to enable/disable this behavior. We use cuda's low-level memory APIs, which are similar to mmap, to extend the memory segments. These APIs separate the allocation of physical memory (cuMemCreate) from the allocation of virtual address space (cuMemAddressReserve) and the associate between them cuMemMap/cuMemSetAccess.

When we allocate a new segment, we allocate enough address space to map basically the entire physical memory of the GPU (there is 256TiB of address space), but we only map enough physical memory to handle the current amount of memory needed by the program. As more is requested, we add more physical memory to the segment. This can work at the granularity of GPU pages which are 2MiB currently.

YouJiacheng avatar Jan 02 '24 12:01 YouJiacheng

Thank you for your interest in our work. GMLake was implemented before April 2023. Our work was originally completed on the PyTorch-1.13.1. After PyTorch-2.0 was released, we adapted our work to the 2.0 version. All of the experiments were conducted on the PyTorch-2.0. However, the expandable_segments was introduced in version 2.1, we have not yet conducted more detailed experiments with this feature. In recent days, we have conducted an in-depth investigation of the implementation of expandable_segment. As mentioned in the code comments, this feature primarily addresses the issue of increasing block size, whereas we address the problem of fragmentation, which is not the same. We have adapted our work to PyTorch 2.1 and conducted a simple comparative test on this feature. On the GPT-NeoX-20B model, the memory utilization rate of the expandable_segment feature was 87%, while for GMLake it was 95%. Expandable_segment is a very good work, and we plan to conduct a detailed analysis of this feature on a variety of models.

If you would like to have a deep talk, please leave an email address, and we will send you our contact information.

ruizhang1230 avatar Jan 08 '24 06:01 ruizhang1230

Thank you for your informative reply. I believe GMLake and expandable_segment are concurrent works, but the mentioned PR introducing expandable_segment is dated Mar 17, 2023 (but released in 2.1).

The purpose of increasing segment size should be to eliminate fragmentation. Theoretically there can be no fragmentation (except intra page fragmentation) with expandable_segment, tensors can always be successfully allocated as long as there are enough spare pages, regardless of whether they are contiguous physically.

Stitching is naturally performed since the allocation of physical memory is separated from the allocation of virtual address space the associate between them.

YouJiacheng avatar Jan 08 '24 09:01 YouJiacheng

The skills used behind should be roughly same: manually managing virtual memory and physical memory mappinp.

eedalong avatar Feb 18 '24 07:02 eedalong