GPTQModel icon indicating copy to clipboard operation
GPTQModel copied to clipboard

[BUG] Vram on cuda:0 usage vs 4.2.5

Open avtc opened this issue 6 months ago • 40 comments

Trying to quantize with gptqmodel commit hash d8f3c78988bb8f11982a5e52361537ffba05d145 with mock_quantization=False, and got an error on first layer with experts (layer 1) (GLM-4.5-Air):

Quantizing mlp.experts.32.gate_proj in layer  [1 of 45] ████-------------------------------------------------------------------------------------------------| 0:13:41 / 5:14:43 [2/46] 4.3%Traceback (most recent call last):
  File "/home/ubuntu/Documents/Quantize/quantize-glm4.5-Air-gptqmodel-moe-prune-smart-4.py", line 489, in <module>
    model.quantize(
    ~~~~~~~~~~~~~~^
        calibration_dataset,
        ^^^^^^^^^^^^^^^^^^^^
        batch_size=BATCH_SIZE,
        ^^^^^^^^^^^^^^^^^^^^^^
        )
        ^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/models/base.py", line 717, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 850, in loop
    name, m = fut.result()
              ~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/utils/threadx.py", line 360, in _run
    result = fn(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 842, in _process_on_worker
    proc.process(module=nm)
    ~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/gptq_processor.py", line 123, in process
    wq, q_scales, q_zeros, q_g_idx, duration, avg_loss, damp_percent, nsamples = g.quantize()
                                                                                 ~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/quantization/gptq.py", line 354, in quantize
    Hinv, damp = self.hessian_inverse(self.H)
                 ~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/quantization/gptq.py", line 257, in hessian_inverse
    H2 = torch.linalg.cholesky(H2)
RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling `cusolverDnCreate(handle)`. If you keep seeing this error, you may use `torch.backends.cuda.preferred_linalg_library()` to try linear algebra operators with other supported backends. See https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.preferred_linalg_library

avtc avatar Oct 03 '25 20:10 avtc

Another threading bug. This one is wild. So it just suddently stopped working. Looks like linalg ops are flaky under threading.

Qubitium avatar Oct 04 '25 00:10 Qubitium

@avtc Both issue should be fixed for good. Please try it now. Will reopen if you still get errors.

Qubitium avatar Oct 04 '25 01:10 Qubitium

INFO  ModuleLooper: forward start (processor=`gptq`, layer=`model.layers.1`, subset=3/7, batches=1057)                     %
Quantizing mlp.experts.32.gate_proj in layer  [1 of 45] █------------------------------------| 0:12:39 / 4:50:57 [2/46] 4.3%Traceback (most recent call last):
  File "/home/ubuntu/Documents/Quantize/quantize-glm4.5-Air-gptqmodel-moe-prune-smart-4.py", line 489, in <module>
    model.quantize(
    ~~~~~~~~~~~~~~^
        calibration_dataset,
        ^^^^^^^^^^^^^^^^^^^^
        batch_size=BATCH_SIZE,
        ^^^^^^^^^^^^^^^^^^^^^^
        )
        ^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/models/base.py", line 946, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 860, in loop
    name, m = fut.result()
              ~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/utils/threadx.py", line 367, in _run
    result = fn(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 852, in _process_on_worker
    proc.process(module=nm)
    ~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/gptq_processor.py", line 123, in process
    wq, q_scales, q_zeros, q_g_idx, duration, avg_loss, damp_percent, nsamples = g.quantize()
                                                                                 ~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/quantization/gptq.py", line 354, in quantize
    Hinv, damp = self.hessian_inverse(self.H)
                 ~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/quantization/gptq.py", line 250, in hessian_inverse
    H2 = TORCH_LINALG.cholesky(H2)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/utils/safe.py", line 47, in locked
    return attr(*args, **kwargs)
RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling `cusolverDnCreate(handle)`. If you keep seeing this error, you may use `torch.backends.cuda.preferred_linalg_library()` to try linear algebra operators with other supported backends. See https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.preferred_linalg_library
terminate called without an active exception
Aborted (core dumped)

Could be a memory issue? With 10 samples instead of 1084 - it does not throw.

The dataset to repro with GLM-4.5-Air:

import random

# Set seed for reproducibility
random.seed(42)

# 1. General Language
c4 = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).shuffle(seed=42).select(range(300)) # 300

# 2. Reasoning
gsm8k = load_dataset("gsm8k", "main", split="train").shuffle(seed=42).select(range(300)) # 300
arc = load_dataset("ai2_arc", "ARC-Challenge", split="train").shuffle(seed=42).select(range(300)) # 300

# 3. Technical/Development
humaneval = load_dataset("openai_humaneval", split="test").shuffle(seed=42).select(range(164)) # 164

# 4. Instruction-Following
alpaca = load_dataset("tat41s0u-lab/alpaca", split="train").shuffle(seed=42).select(range(20)) # 20

# Process each dataset and extract text
calibration_texts = []

# Process C4 dataset
for item in c4:
    calibration_texts.append(item["text"])

# Process GSM8K dataset
for item in gsm8k:
    calibration_texts.append(f"Question: {item['question']}\nAnswer: {item['answer']}")

# Process ARC dataset
for item in arc:
    calibration_texts.append(f"{item['question']}")

# Process HumanEval dataset
for item in humaneval:
    calibration_texts.append(item["prompt"])

# Process Alpaca dataset
for item in alpaca:
    input_text = f"\nInput: {item['input']}" if item.get("input") else ""
    calibration_texts.append(f"Instruction: {item['instruction']}{input_text}\nOutput: {item['output']}")

# Final shuffle to mix domains
random.shuffle(calibration_texts)

# Verify length
print(f"Total samples: {len(calibration_texts)}")

# Use with GPTQModel
calibration_dataset = calibration_texts  # This is your final calibration dataset

avtc avatar Oct 04 '25 05:10 avtc

@avtc What! This looks to be a torch internal bug inferfacing with low level cusolver. I got the same crash as yours and solved with the PR fixes. Let me triple check your stacktrace to see how the heck it is still happening. There is a gigantic global lock on the linalg ops now so there is not possible for two thrads to execute anything torch.linalg related.

Can you send me your full running quant script and I will replicted on my end.

So I don't think it has any absolute relationship with your calibration dataset size or the memory usage of the ops. The dataset size only changed the timing of the calls.

Qubitium avatar Oct 04 '25 06:10 Qubitium

@Qubitium The full script. quantize-glm4.5-Air-gptqmodel-moe-prune-smart-4.py

I have lowered number of samples in:

c4 = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).shuffle(seed=42).select(range(100)) # 300

and got another error, look like memory related:

Quantizing mlp.experts.33.gate_proj in layer  [1 of 45] █------------------------------------| 0:11:43 / 4:29:29 [2/46] 4.3%Traceback (most recent call last):
  File "/home/ubuntu/Documents/Quantize/quantize-glm4.5-Air-gptqmodel-moe-prune-smart-4.py", line 489, in <module>
    model.quantize(
    ~~~~~~~~~~~~~~^
        calibration_dataset,
        ^^^^^^^^^^^^^^^^^^^^
        batch_size=BATCH_SIZE,
        ^^^^^^^^^^^^^^^^^^^^^^
        )
        ^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/models/base.py", line 946, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 860, in loop
    name, m = fut.result()
              ~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/utils/threadx.py", line 367, in _run
    result = fn(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 852, in _process_on_worker
    proc.process(module=nm)
    ~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/gptq_processor.py", line 123, in process
    wq, q_scales, q_zeros, q_g_idx, duration, avg_loss, damp_percent, nsamples = g.quantize()
                                                                                 ~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/quantization/gptq.py", line 494, in quantize
    W1[:, i:] -= err1.unsqueeze(1).matmul(Hinv1[i, i:].unsqueeze(0))
                 ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
terminate called without an active exception
Aborted (core dumped)

avtc avatar Oct 04 '25 06:10 avtc

It there a switch to turn off data-parallelism to check without it?

avtc avatar Oct 04 '25 06:10 avtc

After lowering samples more it passes the layer 1, will check how it goes.

# 1. General Language (40% - 410 samples)
c4 = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).shuffle(seed=42).select(range(100)) # 300

# 2. Reasoning (30% - 307 samples)
gsm8k = load_dataset("gsm8k", "main", split="train").shuffle(seed=42).select(range(100)) # 300
arc = load_dataset("ai2_arc", "ARC-Challenge", split="train").shuffle(seed=42).select(range(300)) # 300

# 3. Technical/Development (20% - 205 samples)
humaneval = load_dataset("openai_humaneval", split="test").shuffle(seed=42).select(range(164)) # 164

# 4. Instruction-Following (10% - 102 samples)
alpaca = load_dataset("tatsu-lab/alpaca", split="train").shuffle(seed=42).select(range(20)) # 20

avtc avatar Oct 04 '25 06:10 avtc

Can you make it crash again and show me your gptqmodel quant logs on cli before the crash? I need to see your gpu memory usage per device which is now printed per completed quant module

It could be memory pressure.

Qubitium avatar Oct 04 '25 07:10 Qubitium

@Qubitium Set back dataset size, the issue reproduced. It does not log experts before the exception.

NFO  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss             | samples     | damp        | time      | fwd_time     | (v)ram                                                                                                              | dynamic     |
INFO  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 1         | self_attn.k_proj     | 0.0000091407 | 247943      | 0.05000     | 0.960     | 17.797       | cuda:0=14.4GB, cuda:1=9.0GB, cuda:2=6.3GB, cuda:3=7.4GB, cuda:4=6.9GB, cuda:5=8.8GB, cuda:6=7.0GB, cuda:7=7.5GB     | None        | 
INFO  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 1         | self_attn.v_proj     | 0.0000003122 | 247943      | 0.05000     | 1.007     | 17.797       | cuda:0=14.4GB, cuda:1=9.0GB, cuda:2=6.3GB, cuda:3=7.4GB, cuda:4=6.9GB, cuda:5=8.8GB, cuda:6=7.0GB, cuda:7=7.5GB     | None        | 
INFO  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 1         | self_attn.q_proj     | 0.0000121996 | 247943      | 0.05000     | 1.558     | 17.797       | cuda:0=14.4GB, cuda:1=9.0GB, cuda:2=6.3GB, cuda:3=7.4GB, cuda:4=6.9GB, cuda:5=8.8GB, cuda:6=7.0GB, cuda:7=7.5GB     | None        | 
INFO  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INFO  ModuleLooper: forward start (processor=`gptq`, layer=`model.layers.1`, subset=2/7, batches=1057)                     %
INFO  GC completed in 13.993s (pass #1).                                                                                   %
INFO  | gptq        | 1         | self_attn.o_proj     | 0.0000000035 | 247943      | 0.05000     | 2.590     | 66.516       | cuda:0=5.9GB, cuda:1=3.5GB, cuda:2=6.7GB, cuda:3=3.5GB, cuda:4=4.1GB, cuda:5=4.1GB, cuda:6=3.5GB, cuda:7=4.1GB      | None        | 
INFO  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INFO  ModuleLooper: forward start (processor=`gptq`, layer=`model.layers.1`, subset=3/7, batches=1057)                     %
Quantizing mlp.experts.32.gate_proj in layer  [1 of 45] █------------------------------------| 0:14:30 / 5:33:30 [2/46] 4.3%Traceback (most recent call last):
  File "/home/ubuntu/Documents/Quantize/quantize-glm4.5-Air-gptqmodel-moe-prune-smart-4.py", line 489, in <module>
    model.quantize(
    ~~~~~~~~~~~~~~^
        calibration_dataset,
        ^^^^^^^^^^^^^^^^^^^^
        batch_size=BATCH_SIZE,
        ^^^^^^^^^^^^^^^^^^^^^^
        )
        ^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/models/base.py", line 946, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 860, in loop
    name, m = fut.result()
              ~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/utils/threadx.py", line 367, in _run
    result = fn(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 852, in _process_on_worker
    proc.process(module=nm)
    ~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/gptq_processor.py", line 123, in process
    wq, q_scales, q_zeros, q_g_idx, duration, avg_loss, damp_percent, nsamples = g.quantize()
                                                                                 ~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/quantization/gptq.py", line 354, in quantize
    Hinv, damp = self.hessian_inverse(self.H)
                 ~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/quantization/gptq.py", line 250, in hessian_inverse
    H2 = TORCH_LINALG.cholesky(H2)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/utils/safe.py", line 47, in locked
    return attr(*args, **kwargs)
RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling `cusolverDnCreate(handle)`. If you keep seeing this error, you may use `torch.backends.cuda.preferred_linalg_library()` to try linear algebra operators with other supported backends. See https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.preferred_linalg_library
terminate called without an active exception
terminate called recursively
terminate called recursively
Aborted (core dumped)

gptq_log_infatuation_time_10_04_2025_10h_30m_21s.log

avtc avatar Oct 04 '25 08:10 avtc

The issue can be closed as related to low vram or large dataset. (btw same dataset worked for me before data parallel)

avtc avatar Oct 04 '25 11:10 avtc

I still consider this a bug. Data parallel should not bloat vram so much so to cause oom. Please check if latest main solved this. Now barriers are setup at every forward point to make sure all bg threads are done before proceeding.

Qubitium avatar Oct 04 '25 20:10 Qubitium

@Qubitium please push gptqmodel.utils.disk to main

avtc avatar Oct 06 '25 09:10 avtc

Fixed

Qubitium avatar Oct 06 '25 13:10 Qubitium

@Qubitium I have tried latest main with "cuda:per": 1 and got CUDA OOM. I test converting GLM-4.5-Air to 8bit with same dataset of 1084 samples that worked before data parallel and offload. Error happen after the first layer with experts finished, on processing second layer with experts.

The trace:

INFO  | gptq    | 1     | mlp.experts.123.down_proj | 0.0000000031 | 17310   | 0.05000 | 0.333 | 82.792   | cuda:0=18.1GB, cuda:1=9.5GB, cuda:2=7.5GB, cuda:3=9.0GB, cuda:4=23.0GB, cuda:5=7.6GB, cuda:6=6.6GB, cuda:7=23.4GB      |         |
INFO  +---------+-------+---------------------------+--------------+---------+---------+-------+----------+------------------------------------------------------------------------------------------------------------------------+---------+
INFO  gc.collect(2) reclaimed 4182 objects in 0.231s                                                           %
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 47/47 [00:03<00:00, 12.42it/s]
...
INFO  gc.collect(2) reclaimed 2033 objects in 0.204s                                                           m
INFO  | process | layer | module                    | loss         | samples | damp    | time  | fwd_time | (v)ram                                                                                                                 | dynamic |
INFO  +---------+-------+---------------------------+--------------+---------+---------+-------+----------+------------------------------------------------------------------------------------------------------------------------+---------+
INFO  | gptq    | 2     | self_attn.k_proj          | 0.0000000443 | 247943  | 0.05000 | 0.806 | 33.977   | cuda:0=8.9GB, cuda:1=15.9GB, cuda:2=16.3GB, cuda:3=18.6GB, cuda:4=10.2GB, cuda:5=14.8GB, cuda:6=14.8GB, cuda:7=10.4GB  |         |
INFO  +---------+-------+---------------------------+--------------+---------+---------+-------+----------+------------------------------------------------------------------------------------------------------------------------+---------+
INFO  | gptq    | 2     | self_attn.v_proj          | 0.0000000054 | 247943  | 0.05000 | 0.839 | 33.977   | cuda:0=8.9GB, cuda:1=15.9GB, cuda:2=16.3GB, cuda:3=18.6GB, cuda:4=10.2GB, cuda:5=14.8GB, cuda:6=14.8GB, cuda:7=10.4GB  |         |
INFO  +---------+-------+---------------------------+--------------+---------+---------+-------+----------+------------------------------------------------------------------------------------------------------------------------+---------+
INFO  | gptq    | 2     | self_attn.q_proj          | 0.0000001415 | 247943  | 0.05000 | 1.256 | 33.977   | cuda:0=8.9GB, cuda:1=15.9GB, cuda:2=16.3GB, cuda:3=18.6GB, cuda:4=10.2GB, cuda:5=14.8GB, cuda:6=14.8GB, cuda:7=10.4GB  |         |
INFO  +---------+-------+---------------------------+--------------+---------+---------+-------+----------+------------------------------------------------------------------------------------------------------------------------+---------+
INFO  gc.collect(2) reclaimed 10677 objects in 0.222s                                                          .
INFO  | gptq    | 2     | self_attn.o_proj          | 0.0000000000 | 247943  | 0.05000 | 3.584 | 70.584   | cuda:0=11.1GB, cuda:1=22.3GB, cuda:2=21.5GB, cuda:3=22.7GB, cuda:4=10.2GB, cuda:5=14.8GB, cuda:6=21.9GB, cuda:7=15.3GB |         |
INFO  +---------+-------+---------------------------+--------------+---------+---------+-------+----------+------------------------------------------------------------------------------------------------------------------------+---------+
INFO  gc.collect(2) reclaimed 2403 objects in 0.203s                                                           %
Forward start (layer=`model.layers.2`, subset=3/7, batches=1057) [2 of 45] -----| 0:44:21 / 11:20:02 [3/46] 6.5%Traceback (most recent call last):
  File "/home/ubuntu/Documents/Quantize/quantize-glm4.5-Air-gptqmodel-moe-prune-smart-4.py", line 495, in <module>
    model.quantize(
    ~~~~~~~~~~~~~~^
        calibration_dataset,
        ^^^^^^^^^^^^^^^^^^^^
        batch_size=BATCH_SIZE,
        ^^^^^^^^^^^^^^^^^^^^^^
        )
        ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/models/base.py", line 940, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 792, in loop
    forward_outputs = self._run_forward_batches(
        module=module,
    ...<10 lines>...
        reuse_kv=reuse_kv,
    )
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 244, in _run_forward_batches
    return self._run_forward_batches_parallel(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        module=module,
        ^^^^^^^^^^^^^^
    ...<11 lines>...
        devices=devices,
        ^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 394, in _run_forward_batches_parallel
    batch_idx, module_output, kv_next = fut.result()
                                        ~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/utils/threadx.py", line 377, in _run
    result = fn(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/utils/looper_helpers.py", line 348, in forward_batch_worker
    module_output = module(*inputs, **additional_inputs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/modeling_layers.py", line 94, in __call__
    return super().__call__(*args, **kwargs)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 395, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 345, in forward
    hidden_states = self.moe(hidden_states, topk_indices, topk_weights).view(*orig_shape)
                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 331, in moe
    expert_output = expert(expert_input)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 223, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
                                           ~~~~~~~~~~~~~~^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/nn_modules/hooked_linear.py", line 221, in forward
    self.forward_hook(self, (input,), output)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 456, in hook
    return inner_hook(module, new_inputs, new_output)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/gptq_processor.py", line 108, in tmp
    g.add_batch(inp[0].data, out.data)  # noqa: F821
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/quantization/gptq.py", line 136, in add_batch
    self.process_batch(inp)
    ~~~~~~~~~~~~~~~~~~^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/quantization/gptq.py", line 188, in process_batch
    self.H = self.H.to(device=reshaped_inp.device)
             ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacity of 23.58 GiB of which 7.50 MiB is free. Including non-PyTorch memory, this process has 23.55 GiB memory in use. Of the allocated memory 22.76 GiB is allocated by PyTorch, and 332.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The quant log attached. Will try with lower samples amount.

gptq_log_womanizers_time_10_06_2025_15h_37m_30s.log

avtc avatar Oct 06 '25 13:10 avtc

@Qubitium tried with offload_to_disk=False:

Traceback (most recent call last):
  File "/home/ubuntu/Documents/Quantize/quantize-glm4.5-Air-gptqmodel-moe-prune-smart-4.py", line 495, in <module>
    model.quantize(
    ~~~~~~~~~~~~~~^
        calibration_dataset,
        ^^^^^^^^^^^^^^^^^^^^
        batch_size=BATCH_SIZE,
        ^^^^^^^^^^^^^^^^^^^^^^
        )
        ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/models/base.py", line 940, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 628, in loop
    input_cache = self.cache_inputs(layers=layers,
                                    calibration_data=processor.calibration_dataset,
                                    use_cache=False)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 537, in cache_inputs
    self.gptq_model.reload_turtle_model()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/models/base.py", line 1326, in reload_turtle_model
    assert turtle_model is not None and  model_local_path is not None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

avtc avatar Oct 06 '25 15:10 avtc

@avtc My bad. Never tested with offload off. Oops. Will fix this.

Right now my focus is to fix a very nasty segfault that is semi random no matter what tricks i tried.

Qubitium avatar Oct 06 '25 16:10 Qubitium

After using 100 samples less from c4 (so 984 in total), the CUDA OOM happen layer later. ''' INFO gc.collect(2) reclaimed 10515 objects in 0.218s % INFO | gptq | 3 | self_attn.o_proj | 0.0000000001 | 181592 | 0.05000 | 2.454 | 57.949 | cuda:0=9.3GB, cuda:1=11.1GB, cuda:2=11.2GB, cuda:3=10.9GB, cuda:4=10.8GB, cuda:5=11.0GB, cuda:6=11.2GB, cuda:7=11.7GB | | INFO +---------+-------+---------------------------+--------------+---------+---------+-------+----------+------------------------------------------------------------------------------------------------------------------------+---------+ INFO gc.collect(2) reclaimed 539 objects in 0.215s % Forward start (layer=model.layers.3, subset=3/7, batches=957) [3 of 45] ------| 1:03:11 / 12:06:36 [4/46] 8.7% '''

avtc avatar Oct 06 '25 17:10 avtc

After using 100 less samples from c4, in total 884, the CUDA OOM happen another layer later.

INFO  | gptq    | 4     | self_attn.o_proj          | 0.0000000002 | 129228  | 0.05000 | 2.916 | 66.576   | cuda:0=8.6GB, cuda:1=21.5GB, cuda:2=19.1GB, cuda:3=6.4GB, cuda:4=23.6GB, cuda:5=19.7GB, cuda:6=13.3GB, cuda:7=14.9GB   |         |
INFO  +---------+-------+---------------------------+--------------+---------+---------+-------+----------+------------------------------------------------------------------------------------------------------------------------+---------+
INFO  gc.collect(2) reclaimed 3322 objects in 0.282s                                                           %
Forward start (layer=`model.layers.4`, subset=3/7, batches=857) [4 of 45] -----| 1:27:06 / 13:21:19 [5/46] 10.9%

could be vram leak from layer-to-layer or idk.

avtc avatar Oct 06 '25 19:10 avtc

@avtc Pull main. I added requirements.txt back so you can easily install updates that are required by new pulls. you need to pull since logbar got updated.

Core logging got major update where forward (pre-quant) and forward replay (post-quant) are both showing as separate progress bars you we can see actuall progress and states instead of getting it stuck for large MoE models..

Image

This is my test on test_qwen3_moe.py script. Each layer has 389 modules and gpu vram is very stable between layers.

Can you do a screen capture too. I weant to see if glm 4.5 air different or not. I don't understand why is glm 4.,5 blowing up oom on forwarding when your log that you provided shows only about 11GB vram in use before forwarding starts and it just blows up the rest of the 13 (24GB-11GB)? Is glm 4.5 air even wider MoE than qwen3 moe? Qwen 3 MoE with 1024 , batch size=4 is only using about 9GB of vram during quant.

Submodule finalzing (post entire layer quant) is also it's own progress bar as well. logbar update allowed me to stack unlimitd pbs.

Qubitium avatar Oct 06 '25 19:10 Qubitium

@avtc btw. Also post what linux os, kerenel version, and cpu you are using. I want to think about why our vram gc strategies are so different.

Qubitium avatar Oct 06 '25 19:10 Qubitium

Will check tomorrow

avtc avatar Oct 06 '25 19:10 avtc

After using 100 samples less from c4 (so 984 in total), the CUDA OOM happen layer later. ''' INFO gc.collect(2) reclaimed 10515 objects in 0.218s % INFO | gptq | 3 | self_attn.o_proj | 0.0000000001 | 181592 | 0.05000 | 2.454 | 57.949 | cuda:0=9.3GB, cuda:1=11.1GB, cuda:2=11.2GB, cuda:3=10.9GB, cuda:4=10.8GB, cuda:5=11.0GB, cuda:6=11.2GB, cuda:7=11.7GB | | INFO +---------+-------+---------------------------+--------------+---------+---------+-------+----------+------------------------------------------------------------------------------------------------------------------------+---------+ INFO gc.collect(2) reclaimed 539 objects in 0.215s % Forward start (layer=model.layers.3, subset=3/7, batches=957) [3 of 45] ------| 1:03:11 / 12:06:36 [4/46] 8.7% '''

After moving DEVICE_THREAD_POOL.wait() to execute after each layer, I was able to proceed to layer 4. And during this test cuda:per was set to 4.

  1. Then I have tried to lower empty_cache_every_n from 1024 to 10 and got an error at the start of quantization:
Exception in thread DP-Janitor:ass #1).                                                                                                   
Traceback (most recent call last): █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 0:00:14 / 0:10:44 [1/46] 2.2%
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/threading.py", line 1043, in _bootstrap_inner░░| 0:00:14 / 0:10:44 [1/46] 2.2%
    self.run()
    ~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/threading.py", line 994, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/utils/threadx.py", line 1355, in _janitor_loop
    use_fn()
    ~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/cuda/memory.py", line 224, in empty_cache
    torch._C._cuda_emptyCache()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
torch.AcceleratorError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

But with empty_cache_every_n=128 it proceeded.

  1. I have very long Forward start Forward rows for self_attn.o_proj on every layer [with experts], so wanted to try with offload off to see if it makes it faster:
Image Image Image

Here is finalization screenshot of layer 3:

Image

Wondering how to make it faster, and why it become so slow.

  1. I am using: kubuntu 24.04.2 LTS kernel: 6.11.0-29-generic (had to pause kernel updates to not reinstall the nvidia driver with each update) cpu: EPYC 9124 mb: ASUS K14PA-U12 ram: 64GB DDR5 4800 8x3090 gpus connected: 1 in pcie-x16, other ones in MCIO x8 ports via cpayne mcio-to-pcie gen5 adapters. nvme: 990 Pro 4TB

avtc avatar Oct 07 '25 09:10 avtc

@Qubitium

3. I have very long Forward start Forward rows for self_attn.o_proj on every layer [with experts], so wanted to try with offload off to see if it makes it faster:

I have fixed turning off the offload, and the result is same - it process very long and the VRAM usage is close to 24Gb on each card.

avtc avatar Oct 07 '25 13:10 avtc

@avtc Pull main. Segfault fixed and now layer finalization (very slow for moe due to so many modules) is now concurrent with next layer quantization. As you have proven, the oom issue is unrelated to offload which only deal with cpu memory.

The logs/progressbars are also more active/accurate at stages.

Qubitium avatar Oct 07 '25 13:10 Qubitium

@avtc On test_qwen3_moe.py (my go to small moe test) where each layer has 380+ modules, my a100 test gpu only use 12GB of vram with 2048 samples of data at batch_size=8. test_qwen3_next with 1700+ moe modules per layer uses 22GB of ram with 256 rows of calib with batch_size=1.

Something is terribly wrong with GLM 4.5 inference that is blowing up memory. Because based on Qwen3-Moe (similar size to GLM 4.5 air), it should be well under 20GB of vram usage with no OOM. Even Qwen3-Next with 3x the size of moe of GLM 4.5 Air, I am not hitting 24GB.

Qubitium avatar Oct 07 '25 13:10 Qubitium

@Qubitium GLM-4.5-Air is 106B model, it is larger than Qwen3-Next. There is no OOM after adding DEVICE_THREAD_POOL.wait() to execute after each layer.

But the VRAM usage and time (probably related to VRAM used) for self_attn.o_proj forward pass is too high.

avtc avatar Oct 07 '25 14:10 avtc

@Qubitium I see, the forward is for expert layers, that's why it takes so much vram and time. (I have disabled quantization for self_attn, but the forward still takes very long before expert layers).

As far as i remember there was an option to enable or disable forward buffering. Idk which mode remains after the switch was removed. But i have used disabled forward buffer option with successful quantization to 8bit of GLM-4.5-Air with 1084 samples.

avtc avatar Oct 07 '25 16:10 avtc

I am trying now https://github.com/avtc/GPTQModel/commit/ff3777bcf7f064f0f591930b453a0054e6565477 based on checkpoint 30d87512b17f3cccb03fb8b431911b0d0f386466 (with offload, few commits before data parallel) + ASYNC_WORKER.join() on layer finished + lock Q.to, to compare with, and it behave much better than latest main with data parallel. I was able to start and run with 1084 samples with additionally excluded self_attn modules without CUDA OOM. The estimate for full quantization is around 2 hours, first 5 layers took ~17.5 minutes. Image

I have tried same exclusion config but with 220 samples on latest main, and it CUDA OOM on layer 5, and proceed very slowly.

dynamic = {
    r"-:model.embed_tokens.weight": {},
    r"-:.*shared_experts": {},
    r"-:.*shared_head": {},
    r"-:lm_head.weight": {},
    r"-:.*mlp.down": {},
    r"-:.*mlp.gate": {}, # vllm does not support exclusion of only gate, need to exclude down & up as well
    r"-:.*mlp.up": {},
    r"-:.*post_attention_layernorm": {},
    r"-:.*self_attn": {},
    r"-:.*norm.weight": {},
    r"-:.*enorm": {},
    r"-:.*hnorm": {},
    r"-:.*eh_proj": {},
    r"-:.*input_layernorm": {},
    }

The script to repro: quantize-glm4.5-air-gptqmodel-2.py

run on gptqmodel venv:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True,max_split_size_mb:128"
export PYTHON_GIL=0
python /home/ubuntu/Documents/Quantize/quantize-glm4.5-air-gptqmodel-2.py

avtc avatar Oct 07 '25 22:10 avtc

@avtc Please run tests/test_torch_replicate.py and test_p2p.py. The replicate test wil check/benchmark the copy mech I used for data parallel which will contribute to the slowness you obvservd if it degrades badly.

# a100 on amd 7343
tests/test_torch_replicate.py::test_torch_replicate_benchmark strategy      time_avg_s    time_min_s    time_max_s    mem_avg_MB    mem_min_MB    mem_max_MB
----------  ------------  ------------  ------------  ------------  ------------  ------------
replicate         0.0018        0.0011        0.0022      128.0000      128.0000      128.0000
deepcopy          0.0033        0.0027        0.0039      384.0000      384.0000      384.0000

Qubitium avatar Oct 08 '25 00:10 Qubitium

@avtc Running your script now. Btw, I like your tp padding code. I think this is a great feature to have for packing!

-> Doing the padding conversion which is super slow on my slow cpu/disk. Going to restart and skip this.

Btw, I see that you thermal limitd your 3090 to 280 watts? I remember 3090 is 350-375w? Is 280w a sweespot for 3090 where anymore probably doesn't squeeze much out of it?

Qubitium avatar Oct 08 '25 00:10 Qubitium