Context

What is the purpose of this PR? Is it to

[x] add a new feature
[ ] fix a bug
[ ] update tests and/or documentation
[ ] other (please add here)

Changelog

What are the changes made in this PR?

This PR adds Ascend NPU as a backend for eight recipes running on a single device including eleuther_eval, full_finetune_single_device, generate, dev/generate_v2, knowledge_distillation_single_device, lora_dpo_single_device, lora_finetune_single_device and quantize.

Environment

We have conducted basic usage test in the following environment.

OS: ubuntu 20.04
NPU: Atlas 800T A2
CANN: 8.0.RC3
torch-npu: 2.5.1 rc1
torch: 2.5.1

Recipe: eleuther_eval

Model: Llama-3.2-1B-Instruct
Config(Only list the main changes)

device: npu

Logs

(torchtune_npu) [root@localhost torchtune]# tune run eleuther_eval --config llama3_2/evaluation
INFO:torchtune.utils._logging:Running EleutherEvalRecipe with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: /tmp/Llama-3.2-1B-Instruct
  recipe_checkpoint: null
device: npu
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
quantizer: null
resume_from_checkpoint: false
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Running evaluation on the following tasks: ['truthfulqa_mc2']
INFO:lm-eval:Building contexts for truthfulqa_mc2 on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:02<00:00, 399.47it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5882/5882 [00:44<00:00, 133.35it/s]
INFO:torchtune.utils._logging:Eval completed in 51.04 seconds.
INFO:torchtune.utils._logging:Max memory allocated: 12.21 GB
INFO:torchtune.utils._logging:

|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.4393|±  |0.0144|

Recipe: full_finetune_single_device

Model: Llama-3.2-1B-Instruct
Config(Only list the main changes)

optimizer:
  _component_: torch.optim.AdamW # change the optimizer
device: npu

Logs

(torchtune_npu) [root@localhost torchtune]# tune run full_finetune_single_device --config llama3_2/1B_full_single_device
INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: /tmp/torchtune/llama3_2_1B/full_single_device
  recipe_checkpoint: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: false
  source: /tmp/dataset/alpaca_data
device: npu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 100
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /tmp/torchtune/llama3_2_1B/full_single_device/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  lr: 2.0e-05
optimizer_in_bwd: true
output_dir: /tmp/torchtune/llama3_2_1B/full_single_device
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: /tmp/torchtune/llama3_2_1B/full_single_device/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 4147997196. Local seed is seed + rank = 4147997196 + 0
Writing logs to /tmp/torchtune/llama3_2_1B/full_single_device/logs/log_1736148110.txt
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Memory stats after model init:
        NPU peak memory allocation: 3.31 GiB
        NPU peak memory reserved: 3.32 GiB
        NPU peak memory active: 3.31 GiB
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|100|Loss: 1.2042158842086792: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:27<00:00,  3.89it/s]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to /tmp/torchtune/llama3_2_1B/full_single_device/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|100|Loss: 1.2042158842086792: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:31<00:00,  3.17it/s]

Recipe: generate

Model: Llama-3.2-1B-Instruct
Config(Only list the main changes)

device: npu

Logs

(torchtune_npu) [root@localhost torchtune]# tune run generate --config generation
INFO:torchtune.utils._logging:Running InferenceRecipe with resolved config:

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: ./
device: npu
dtype: bf16
enable_kv_cache: true
max_new_tokens: 300
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
output_dir: ./
prompt:
  system: null
  user: Tell me a joke.
quantizer: null
seed: 1234
temperature: 0.6
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
  prompt_template: null
top_k: 300

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1234. Local seed is seed + rank = 1234 + 0
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tell me a joke.A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?"

The librarian replied, "It rings a bell, but I'm not sure if it's here or not."
INFO:torchtune.utils._logging:Time for inference: 3.94 sec total, 13.71 tokens/sec
INFO:torchtune.utils._logging:Bandwidth achieved: 34.48 GB/s
INFO:torchtune.utils._logging:Memory used: 3.56 GB

Recipe: dev/generate_v2

Model: Llama-2-7b-hf
Config(Only list the main changes)

device: npu

Logs

(torchtune_npu) [root@localhost torchtune]# tune run dev/generate_v2 --config llama2/generation_v2
INFO:torchtune.utils._logging:Running InferenceRecipe with resolved config:

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/llama2-7b-hf
  checkpoint_files:
  - pytorch_model-00001-of-00002.bin
  - pytorch_model-00002-of-00002.bin
  model_type: LLAMA2
  output_dir: ./
device: npu
dtype: bf16
log_level: INFO
max_new_tokens: 200
model:
  _component_: torchtune.models.llama2.llama2_7b
output_dir: ./
prompt:
  system: You are a helpful and creative AI assistant.
  user: What is the capital of France?
seed: 1234
temperature: 0.6
tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  max_seq_len: 2048
  path: /tmp/llama2-7b-hf/tokenizer.model
top_k: 300

/home/anaconda3/envs/torchtune_npu/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if self.device.type != 'cpu':
INFO:torchtune.utils._logging:Model was initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:

</INST>
<</SYS>>

You are a helpful and creative AI assistant.
<</SYS>>

 What is the capital of France?  [/INST]
</INST>
<</SYS>>

You are a helpful and creative AI assistant.
<</SYS>>

 What is the capital of France?  [/INST]
</INST>
<</SYS>>

You are a helpful and creative AI assistant.
<</SYS>>

 What is the capital of France?  [/INST]
</INST>
<</SYS>>

You are a helpful and creative AI assistant.
<</SYS>>

 What is the capital of France?  [/INST]
</INST>
<</SYS>>

You are a helpful and creative AI assistant.

INFO:torchtune.utils._logging:Time for inference: 17.30 sec total, 11.62 tokens/sec
INFO:torchtune.utils._logging:Bandwidth achieved: 158.80 GB/s
INFO:torchtune.utils._logging:Max memory allocated: 13.95 GB

Recipe: knowledge_distillation_single_device

Model: Qwen2-0.5B-Instruct, Qwen2-1.5B-Instruct
Config(Only list the main changes)

device: npu

Logs

(torchtune_npu) [root@localhost torchtune]# tune run knowledge_distillation_single_device --config qwen2/1.5_to_0.5B_KD_lora_single_device
INFO:torchtune.utils._logging:Running KDRecipeSingleDevice with resolved config:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Qwen2-0.5B-Instruct
  checkpoint_files:
  - model.safetensors
  model_type: QWEN2
  output_dir: /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device
  recipe_checkpoint: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_cleaned_dataset
  packed: false
  source: /tmp/dataset/alpaca_data_cleaned
device: npu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 8
kd_loss:
  _component_: torchtune.modules.loss.ForwardKLWithChunkedOutputLoss
kd_ratio: 0.5
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 100
max_steps_per_epoch: 200
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/logs
model:
  _component_: torchtune.models.qwen2.lora_qwen2_0_5b
  apply_lora_to_mlp: true
  lora_alpha: 64
  lora_attn_modules:
  - q_proj
  - v_proj
  - output_proj
  lora_rank: 32
optimizer:
  _component_: torch.optim.AdamW
  lr: 0.0003
  weight_decay: 0.01
output_dir: /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
teacher_checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Qwen2-1.5B-Instruct
  checkpoint_files:
  - model.safetensors
  model_type: QWEN2
  output_dir: /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device
  recipe_checkpoint: null
teacher_model:
  _component_: torchtune.models.qwen2.qwen2_1_5b
tokenizer:
  _component_: torchtune.models.qwen2.qwen2_tokenizer
  max_seq_len: null
  merges_file: /tmp/Qwen2-0.5B-Instruct/merges.txt
  path: /tmp/Qwen2-0.5B-Instruct/vocab.json

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 535418630. Local seed is seed + rank = 535418630 + 0
Writing logs to /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/logs/log_1736149296.txt
/home/anaconda3/envs/torchtune_npu/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if self.device.type != 'cpu':
INFO:torchtune.utils._logging:Student model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Memory stats initializing student model:
INFO:torchtune.utils._logging:Memory stats after student model init:
        NPU peak memory allocation: 1.37 GiB
        NPU peak memory reserved: 1.39 GiB
        NPU peak memory active: 1.37 GiB
INFO:torchtune.utils._logging:Teacher model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Memory stats after teacher model init:
        NPU peak memory allocation: 5.20 GiB
        NPU peak memory reserved: 5.22 GiB
        NPU peak memory active: 5.20 GiB
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:Optimizer and loss are initialized.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:Learning rate scheduler is initialized.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|200|Loss: 1.4175333976745605: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [15:34<00:00,  4.69s/it]INFO:torchtune.utils._logging:Model checkpoint of size 0.92 GiB saved to /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.03 GiB saved to /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/epoch_0/adapter_model.pt
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.03 GiB saved to /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/epoch_0/adapter_model.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.00 GiB saved to /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/epoch_0/adapter_config.json
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|200|Loss: 1.4175333976745605: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [15:54<00:00,  4.77s/it]

Recipe: lora_dpo_single_device

Model: Llama-2-7b-hf
Config(Only list the main changes)

optimizer:
  _component_: torch.optim.AdamW
  fused: False # not supported on Ascend NPU device
device: npu

Logs

(torchtune_npu) [root@localhost torchtune]# tune run lora_dpo_single_device --config llama2/7B_lora_dpo_single_device
INFO:torchtune.utils._logging:Running LoRADPORecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  adapter_checkpoint: null
  checkpoint_dir: /tmp/llama2-7b-hf
  checkpoint_files:
  - pytorch_model-00001-of-00002.bin
  - pytorch_model-00002-of-00002.bin
  model_type: LLAMA2
  output_dir: /tmp/torchtune/llama2_7B/lora_dpo_single_device
  recipe_checkpoint: null
compile: false
dataset:
  _component_: torchtune.datasets.stack_exchange_paired_dataset
  data_files: /tmp/stack-exchange-paired/data/rl/merged_rl.csv
  source: csv
  split: train[:10%]
device: npu
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.rlhf.loss.DPOLoss
lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 10
max_steps_per_epoch: 100
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /tmp/torchtune/llama2_7B/lora_dpo_single_device/logs
model:
  _component_: torchtune.models.llama2.lora_llama2_7b
  apply_lora_to_mlp: true
  apply_lora_to_output: false
  lora_alpha: 16
  lora_attn_modules:
  - q_proj
  - v_proj
  - output_proj
  lora_dropout: 0.0
  lora_rank: 8
optimizer:
  _component_: torch.optim.AdamW
  fused: false
  lr: 0.0005
  weight_decay: 0.05
output_dir: /tmp/torchtune/llama2_7B/lora_dpo_single_device
resume_from_checkpoint: false
save_adapter_weights_only: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  max_seq_len: 1024
  path: /tmp/llama2-7b-hf/tokenizer.model

INFO:torchtune.utils._logging:Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3863624611. Local seed is seed + rank = 3863624611 + 0
Writing logs to /tmp/torchtune/llama2_7B/lora_dpo_single_device/logs/log_1736150859.txt
/home/anaconda3/envs/torchtune_npu/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if self.device.type != 'cpu':
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Memory stats after model init:
        NPU peak memory allocation: 13.03 GiB
        NPU peak memory reserved: 13.04 GiB
        NPU peak memory active: 13.03 GiB
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:Optimizer and loss are initialized.
INFO:torchtune.utils._logging:Loss function is initialized.
Generating train split: 7435908 examples [05:25, 22865.95 examples/s]
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:Learning rate scheduler is initialized.
1|100|Loss: 0.5470260381698608: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [07:33<00:00,  4.23s/it]INFO:torchtune.utils._logging:Model checkpoint of size 9.29 GiB saved to /tmp/torchtune/llama2_7B/lora_dpo_single_device/epoch_0/ft-model-00001-of-00002.safetensors
INFO:torchtune.utils._logging:Model checkpoint of size 3.26 GiB saved to /tmp/torchtune/llama2_7B/lora_dpo_single_device/epoch_0/ft-model-00002-of-00002.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.03 GiB saved to /tmp/torchtune/llama2_7B/lora_dpo_single_device/epoch_0/adapter_model.pt
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.03 GiB saved to /tmp/torchtune/llama2_7B/lora_dpo_single_device/epoch_0/adapter_model.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.00 GiB saved to /tmp/torchtune/llama2_7B/lora_dpo_single_device/epoch_0/adapter_config.json
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|100|Loss: 0.5470260381698608: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [10:21<00:00,  6.21s/it]

Recipe: lora_finetune_single_device

Model: Llama-3.2-1B-Instruct
Config(Only list the main changes)

optimizer:
  _component_: torch.optim.AdamW
  fused: False # not supported on Ascend NPU device
device: npu
dtype: fp32

Logs

(torchtune_npu) [root@localhost torchtune]# tune run lora_finetune_single_device --config llama3_2/1B_lora_single_device
INFO:torchtune.utils._logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: /tmp/torchtune/llama3_2_1B/lora_single_device
  recipe_checkpoint: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_cleaned_dataset
  packed: false
  source: /tmp/alpaca_data_cleaned
device: npu
dtype: fp32
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 2
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 10
max_steps_per_epoch: 100
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /tmp/torchtune/llama3_2_1B/lora_single_device/logs
model:
  _component_: torchtune.models.llama3_2.lora_llama3_2_1b
  apply_lora_to_mlp: true
  lora_alpha: 128
  lora_attn_modules:
  - q_proj
  - v_proj
  - output_proj
  lora_dropout: 0.0
  lora_rank: 64
optimizer:
  _component_: torch.optim.AdamW
  fused: false
  lr: 0.0003
  weight_decay: 0.01
output_dir: /tmp/torchtune/llama3_2_1B/lora_single_device
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: /tmp/torchtune/llama3_2_1B/lora_single_device/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
save_adapter_weights_only: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1697766874. Local seed is seed + rank = 1697766874 + 0
Writing logs to /tmp/torchtune/llama3_2_1B/lora_single_device/logs/log_1736154976.txt
INFO:torchtune.utils._logging:Model is initialized with precision torch.float32.
INFO:torchtune.utils._logging:Memory stats after model init:
        NPU peak memory allocation: 4.79 GiB
        NPU peak memory reserved: 4.81 GiB
        NPU peak memory active: 4.79 GiB
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:Optimizer and loss are initialized.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:Learning rate scheduler is initialized.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|100|Loss: 0.9386640191078186: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:58<00:00,  1.61it/s]INFO:torchtune.utils._logging:Starting checkpoint save...
INFO:torchtune.utils._logging:Model checkpoint of size 4.60 GiB saved to /tmp/torchtune/llama3_2_1B/lora_single_device/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.16 GiB saved to /tmp/torchtune/llama3_2_1B/lora_single_device/epoch_0/adapter_model.pt
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.16 GiB saved to /tmp/torchtune/llama3_2_1B/lora_single_device/epoch_0/adapter_model.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.00 GiB saved to /tmp/torchtune/llama3_2_1B/lora_single_device/epoch_0/adapter_config.json
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
INFO:torchtune.utils._logging:Checkpoint saved in 98.77 seconds.
1|100|Loss: 0.9386640191078186: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:36<00:00,  1.57s/it]

Recipe: quantize

Model: Llama-2-7b-hf
Config(Only list the main changes)

device: npu

Logs

(torchtune_npu) [root@localhost torchtune]# tune run quantize --config quantization
INFO:torchtune.utils._logging:Running QuantizationRecipe with resolved config:

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/llama2-7b-hf
  checkpoint_files:
  - pytorch_model-00001-of-00002.bin
  - pytorch_model-00002-of-00002.bin
  model_type: LLAMA2
  output_dir: /tmp/torchtune/llama2_7B/quantized
  recipe_checkpoint: null
device: npu
dtype: bf16
model:
  _component_: torchtune.models.llama2.llama2_7b
output_dir: /tmp/torchtune/llama2_7B/quantized
quantizer:
  _component_: torchtune.training.quantization.Int8DynActInt4WeightQuantizer
  groupsize: 256
seed: 1234

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1234. Local seed is seed + rank = 1234 + 0
/home/anaconda3/envs/torchtune_npu/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if self.device.type != 'cpu':
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Time for quantization: 0.52 sec
INFO:torchtune.utils._logging:Memory used: 13.95 GB
INFO:torchtune.utils._logging:Model checkpoint of size 6.49 GiB saved to /tmp/torchtune/llama2_7B/quantized/pytorch_model-00001-of-00002-8da4w.pt

Feel free to provide valuable improvement suggestions! ☺️

Jan 07 '25 03:01 Nicorgi

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2234

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit 83d53ad43d0b29b953917937b885dc3734c5c8e7 with merge base 213f38605ff0b7b1e20f85a9e032710be04c82c9 (): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Jan 07 '25 03:01 pytorch-bot[bot]

Hi @Nicorgi!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Jan 07 '25 03:01 facebook-github-bot

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Jan 07 '25 04:01 facebook-github-bot

Hi @RdoubleA, @joecummings, @ebsmothers:

Could you please help to review this PR and give me some advice? Thank you for your time! 😄

Jan 07 '25 06:01 Nicorgi

It is very helpful for me, nice work!

Jan 07 '25 07:01 FightingZhen

Thanks @Nicorgi for the PR! Please give us 1-2 days as we catch up from the holiday backlog, we will review this soon!

Jan 07 '25 23:01 ebsmothers

Hi @RdoubleA, @joecummings, @ebsmothers:

Could you take some time to review my code? Thanks a lot. 😄

Jan 13 '25 01:01 Nicorgi

Fantastic work. I'd like to ask whether the Ascend NPU can be directly compatible with PyTorch.

Jan 13 '25 09:01 dz1iang

Fantastic work. I'd like to ask whether the Ascend NPU can be directly compatible with PyTorch.

Hi @dz1iang, you can first pip install torch torch_npu and then import modules as shown below in your code.

import torch
import torch_npu

For more details, you can refer to our docs. Hope this can solve your problem. 🤗

Jan 13 '25 11:01 Nicorgi

Add Ascend NPU as a backend for single device recipes

Context

Changelog

Environment

Recipe: eleuther_eval

Recipe: full_finetune_single_device

Recipe: generate

Recipe: dev/generate_v2

Recipe: knowledge_distillation_single_device

Recipe: lora_dpo_single_device

Recipe: lora_finetune_single_device

Recipe: quantize

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2234

:white_check_mark: No Failures

Action Required

Process