[feat] Support sparse copy for a large number of experts

Open sohamparikh opened this issue 1 year ago • 1 comments

🐞 Describe the Bug

Facing an OutOfResources error with 64 fine-grained experts and dropless MoE enabled, even though there is sufficient GPU memory.

🔄 Steps to Reproduce

Steps to reproduce the behavior:

Fast-LLM Docker image tag: sha-8f06975
Training config:

model:
  base_model:
    transformer:
      normalization:
        type: rms_norm
      num_layers: 16
      hidden_size: 2048
      num_attention_heads: 16
      head_groups: 1
      add_linear_biases: false
      ffn_hidden_size: 1024
      kv_channels: 128
      use_rotary_embeddings: true
      gated: true
      activation_type: silu
      num_experts: 64
      num_experts_per_token: 8
      init_method_std: 0.02
      init_method_std_qkv: 0.02
      init_method_std_attn_proj: 0.0035355339059327372
      init_method_std_mlp_1: 0.02
      init_method_std_mlp_2: 0.0035355339059327372
      expert_z_loss_coefficient: 0.001
      mlp_lr_scale:
      - null
    vocab_size: 131072
    use_position_embeddings: false
    tie_word_embeddings: false
    init_method_std_embed: 0.02
    cross_entropy_impl: fused
  multi_stage:
    zero_stage: 1
  distributed:
    world_size: 16
    rank: 0
    local_world_size: 8
    distributed_timeout: 3600.0
    seed: 984059
    training_dtype: bfloat16
pretrained:
  format: fast_llm
  path: null
  load_config: architecture
training:
  validation:
    interval: 1000
    iterations: 25
  logs:
    interval: 10
  export:
    format: fast_llm
    optimizer_state: true
  wandb:
    group_name: pilot
    project_name: olmoe-1b-7b
    entity_name: tscholak
  train_iters: 1000
  num_workers: 8
batch:
  micro_batch_size: 2
  depth_first_micro_batches: 1
  breadth_first_micro_batches: 1
  sequential_micro_batches: 1
  batch_size: 32
  sequence_length: 2048
  micro_sequence_length: 2048
data:
  tokenizer:
    path: null
  fim:
    rate: 0.0
  split:
  - 998.0
  - 2.0
  - 0.0
  format: file
  path:
  - /mnt/workspace/inputs/openwebtext-SmolLM2/fast_llm_dataset.json
profiling:
  cuda: false
optimizer:
  learning_rate:
    base: 0.0001
  weight_decay: 0.1
  beta_2: 0.95

Error log

  File "/app/fast_llm/layers/transformer/mixture_of_experts.py", line 114, in forward
    return self._mlp_forward(hidden_states, scores, top_experts).view_as(input_), None  # noqa
  File "/app/fast_llm/layers/transformer/mixture_of_experts.py", line 118, in _forward_dropless
    sparse_map = get_sparse_map(top_experts, self._num_experts, dynamic_shape=self._dynamic_shape)
  File "/app/fast_llm/functional/triton/sparse_copy.py", line 301, in get_sparse_map
    sparse_map_kernel[(triton.cdiv(num_rows_dense, block_size),)](
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 180, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 412, in run
    kernel.run(grid_0, grid_1, grid_2, metadata.num_warps,
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 339, in __getattribute__
    self._init_handles()
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 332, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 262144, Hardware limit: 232448. Reducing block sizes or `num_stages` may help.

🎯 Expected Behavior

Should work without going out of resources.

📝 Additional Context

Include any other information that may help us understand the issue, such as:

Works for micro batch size 1 and micro-sequence length 2048. Increasing the micro-batch size to 2 or micro sequence length to 4096 throws the error.

Nov 20 '24 23:11 sohamparikh

This is a triton bug, our implementation of dropless mlp might not be able to handle that many experts. Fixing this will need an in-depth investigation and some implementation work. In the meantime the model should be runnable by disabling dropless moe

Nov 20 '24 23:11 jlamypoirier