swiftLLM peak memory calculation

I saw this line in source code: https://github.com/interestingLSY/swiftLLM/blob/682cf9a28f97f7490409981a2f181528f377eb5d/swiftllm/worker/model.py#L116-L122

after forward, some memory has been released, for example memory for Intermediate Activations and memory for input ids .etc so could the calculation make a high block numbers than reality? thanks

Oct 13 '25 02:10 vincent-pli

I don't think so. To answer your question, we should first know how PyTorch manages GPU memory. As far as I know, PyTorch has two allocation modes: either PyTorch allocates and manages a memory pool itself (the "caching" allocator), or it relies on cudaMallocAsync and cudaFreeAsync. We now focus on the former mode since it's PyTorch's default mode. In this mode, when a tensor is freed, PyTorch doesn't immediately call cudaFree (which informs the CUDA driver that the memory is freed), instead, it just marks the corresponding region as free internally. This allows fast memory deallocation without device synchronizations. You may find the relevant document here. The size of the memory pool grows monotonically unless user calls torch.cuda.empty_cache() explicitly, meaning that it records the peak memory usage.

Then let's look at the meaning of torch.cuda.mem_get_info(). According to its document, it returns the available GPU memory using cudaMemGetInfo() provided by the CUDA driver. So this actually returns the size of PyTorch's memory pool.

Combining those two points above, we get - peak_memory = total_memory - free_memory.

Oct 13 '25 03:10 interestingLSY

An example:

Oct 13 '25 05:10 interestingLSY

wow, thanks for your clarification, very useful, I will keep the issue open for people who has same confuse

Oct 14 '25 02:10 vincent-pli