Add Ascend NPU as a backend for single device recipes
Context
What is the purpose of this PR? Is it to
- [x] add a new feature
- [ ] fix a bug
- [ ] update tests and/or documentation
- [ ] other (please add here)
Changelog
What are the changes made in this PR?
- This PR adds Ascend NPU as a backend for eight recipes running on a single device including eleuther_eval, full_finetune_single_device, generate, dev/generate_v2, knowledge_distillation_single_device, lora_dpo_single_device, lora_finetune_single_device and quantize.
Environment
We have conducted basic usage test in the following environment.
- OS: ubuntu 20.04
- NPU: Atlas 800T A2
- CANN: 8.0.RC3
- torch-npu: 2.5.1 rc1
- torch: 2.5.1
Recipe: eleuther_eval
- Model: Llama-3.2-1B-Instruct
- Config(Only list the main changes)
device: npu
- Logs
(torchtune_npu) [root@localhost torchtune]# tune run eleuther_eval --config llama3_2/evaluation
INFO:torchtune.utils._logging:Running EleutherEvalRecipe with resolved config:
batch_size: 8
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-3.2-1B-Instruct
checkpoint_files:
- model.safetensors
model_type: LLAMA3_2
output_dir: /tmp/Llama-3.2-1B-Instruct
recipe_checkpoint: null
device: npu
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
_component_: torchtune.models.llama3_2.llama3_2_1b
quantizer: null
resume_from_checkpoint: false
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Running evaluation on the following tasks: ['truthfulqa_mc2']
INFO:lm-eval:Building contexts for truthfulqa_mc2 on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:02<00:00, 399.47it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5882/5882 [00:44<00:00, 133.35it/s]
INFO:torchtune.utils._logging:Eval completed in 51.04 seconds.
INFO:torchtune.utils._logging:Max memory allocated: 12.21 GB
INFO:torchtune.utils._logging:
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2| 2|none | 0|acc |↑ |0.4393|± |0.0144|
Recipe: full_finetune_single_device
- Model: Llama-3.2-1B-Instruct
- Config(Only list the main changes)
optimizer:
_component_: torch.optim.AdamW # change the optimizer
device: npu
- Logs
(torchtune_npu) [root@localhost torchtune]# tune run full_finetune_single_device --config llama3_2/1B_full_single_device
INFO:torchtune.utils._logging:Running FullFinetuneRecipeSingleDevice with resolved config:
batch_size: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
checkpoint_files:
- model.safetensors
model_type: LLAMA3_2
output_dir: /tmp/torchtune/llama3_2_1B/full_single_device
recipe_checkpoint: null
compile: false
dataset:
_component_: torchtune.datasets.alpaca_dataset
packed: false
source: /tmp/dataset/alpaca_data
device: npu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: 100
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /tmp/torchtune/llama3_2_1B/full_single_device/logs
model:
_component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
_component_: torch.optim.AdamW
lr: 2.0e-05
optimizer_in_bwd: true
output_dir: /tmp/torchtune/llama3_2_1B/full_single_device
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: /tmp/torchtune/llama3_2_1B/full_single_device/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 4147997196. Local seed is seed + rank = 4147997196 + 0
Writing logs to /tmp/torchtune/llama3_2_1B/full_single_device/logs/log_1736148110.txt
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Memory stats after model init:
NPU peak memory allocation: 3.31 GiB
NPU peak memory reserved: 3.32 GiB
NPU peak memory active: 3.31 GiB
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:In-backward optimizers are set up.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|100|Loss: 1.2042158842086792: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:27<00:00, 3.89it/s]INFO:torchtune.utils._logging:Model checkpoint of size 2.30 GiB saved to /tmp/torchtune/llama3_2_1B/full_single_device/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|100|Loss: 1.2042158842086792: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:31<00:00, 3.17it/s]
Recipe: generate
- Model: Llama-3.2-1B-Instruct
- Config(Only list the main changes)
device: npu
- Logs
(torchtune_npu) [root@localhost torchtune]# tune run generate --config generation
INFO:torchtune.utils._logging:Running InferenceRecipe with resolved config:
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-3.2-1B-Instruct
checkpoint_files:
- model.safetensors
model_type: LLAMA3_2
output_dir: ./
device: npu
dtype: bf16
enable_kv_cache: true
max_new_tokens: 300
model:
_component_: torchtune.models.llama3_2.llama3_2_1b
output_dir: ./
prompt:
system: null
user: Tell me a joke.
quantizer: null
seed: 1234
temperature: 0.6
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
prompt_template: null
top_k: 300
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1234. Local seed is seed + rank = 1234 + 0
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Tell me a joke.A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?"
The librarian replied, "It rings a bell, but I'm not sure if it's here or not."
INFO:torchtune.utils._logging:Time for inference: 3.94 sec total, 13.71 tokens/sec
INFO:torchtune.utils._logging:Bandwidth achieved: 34.48 GB/s
INFO:torchtune.utils._logging:Memory used: 3.56 GB
Recipe: dev/generate_v2
- Model: Llama-2-7b-hf
- Config(Only list the main changes)
device: npu
- Logs
(torchtune_npu) [root@localhost torchtune]# tune run dev/generate_v2 --config llama2/generation_v2
INFO:torchtune.utils._logging:Running InferenceRecipe with resolved config:
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/llama2-7b-hf
checkpoint_files:
- pytorch_model-00001-of-00002.bin
- pytorch_model-00002-of-00002.bin
model_type: LLAMA2
output_dir: ./
device: npu
dtype: bf16
log_level: INFO
max_new_tokens: 200
model:
_component_: torchtune.models.llama2.llama2_7b
output_dir: ./
prompt:
system: You are a helpful and creative AI assistant.
user: What is the capital of France?
seed: 1234
temperature: 0.6
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
max_seq_len: 2048
path: /tmp/llama2-7b-hf/tokenizer.model
top_k: 300
/home/anaconda3/envs/torchtune_npu/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if self.device.type != 'cpu':
INFO:torchtune.utils._logging:Model was initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:
</INST>
<</SYS>>
You are a helpful and creative AI assistant.
<</SYS>>
What is the capital of France? [/INST]
</INST>
<</SYS>>
You are a helpful and creative AI assistant.
<</SYS>>
What is the capital of France? [/INST]
</INST>
<</SYS>>
You are a helpful and creative AI assistant.
<</SYS>>
What is the capital of France? [/INST]
</INST>
<</SYS>>
You are a helpful and creative AI assistant.
<</SYS>>
What is the capital of France? [/INST]
</INST>
<</SYS>>
You are a helpful and creative AI assistant.
INFO:torchtune.utils._logging:Time for inference: 17.30 sec total, 11.62 tokens/sec
INFO:torchtune.utils._logging:Bandwidth achieved: 158.80 GB/s
INFO:torchtune.utils._logging:Max memory allocated: 13.95 GB
Recipe: knowledge_distillation_single_device
- Model: Qwen2-0.5B-Instruct, Qwen2-1.5B-Instruct
- Config(Only list the main changes)
device: npu
- Logs
(torchtune_npu) [root@localhost torchtune]# tune run knowledge_distillation_single_device --config qwen2/1.5_to_0.5B_KD_lora_single_device
INFO:torchtune.utils._logging:Running KDRecipeSingleDevice with resolved config:
batch_size: 8
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Qwen2-0.5B-Instruct
checkpoint_files:
- model.safetensors
model_type: QWEN2
output_dir: /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device
recipe_checkpoint: null
compile: false
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
packed: false
source: /tmp/dataset/alpaca_data_cleaned
device: npu
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 8
kd_loss:
_component_: torchtune.modules.loss.ForwardKLWithChunkedOutputLoss
kd_ratio: 0.5
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 100
max_steps_per_epoch: 200
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/logs
model:
_component_: torchtune.models.qwen2.lora_qwen2_0_5b
apply_lora_to_mlp: true
lora_alpha: 64
lora_attn_modules:
- q_proj
- v_proj
- output_proj
lora_rank: 32
optimizer:
_component_: torch.optim.AdamW
lr: 0.0003
weight_decay: 0.01
output_dir: /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
teacher_checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Qwen2-1.5B-Instruct
checkpoint_files:
- model.safetensors
model_type: QWEN2
output_dir: /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device
recipe_checkpoint: null
teacher_model:
_component_: torchtune.models.qwen2.qwen2_1_5b
tokenizer:
_component_: torchtune.models.qwen2.qwen2_tokenizer
max_seq_len: null
merges_file: /tmp/Qwen2-0.5B-Instruct/merges.txt
path: /tmp/Qwen2-0.5B-Instruct/vocab.json
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 535418630. Local seed is seed + rank = 535418630 + 0
Writing logs to /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/logs/log_1736149296.txt
/home/anaconda3/envs/torchtune_npu/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if self.device.type != 'cpu':
INFO:torchtune.utils._logging:Student model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Memory stats initializing student model:
INFO:torchtune.utils._logging:Memory stats after student model init:
NPU peak memory allocation: 1.37 GiB
NPU peak memory reserved: 1.39 GiB
NPU peak memory active: 1.37 GiB
INFO:torchtune.utils._logging:Teacher model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Memory stats after teacher model init:
NPU peak memory allocation: 5.20 GiB
NPU peak memory reserved: 5.22 GiB
NPU peak memory active: 5.20 GiB
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:Optimizer and loss are initialized.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:Learning rate scheduler is initialized.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|200|Loss: 1.4175333976745605: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [15:34<00:00, 4.69s/it]INFO:torchtune.utils._logging:Model checkpoint of size 0.92 GiB saved to /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.03 GiB saved to /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/epoch_0/adapter_model.pt
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.03 GiB saved to /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/epoch_0/adapter_model.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.00 GiB saved to /tmp/torchtune/qwen2_1_5_to_0_5B/KD_lora_single_device/epoch_0/adapter_config.json
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|200|Loss: 1.4175333976745605: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [15:54<00:00, 4.77s/it]
Recipe: lora_dpo_single_device
- Model: Llama-2-7b-hf
- Config(Only list the main changes)
optimizer:
_component_: torch.optim.AdamW
fused: False # not supported on Ascend NPU device
device: npu
- Logs
(torchtune_npu) [root@localhost torchtune]# tune run lora_dpo_single_device --config llama2/7B_lora_dpo_single_device
INFO:torchtune.utils._logging:Running LoRADPORecipeSingleDevice with resolved config:
batch_size: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
adapter_checkpoint: null
checkpoint_dir: /tmp/llama2-7b-hf
checkpoint_files:
- pytorch_model-00001-of-00002.bin
- pytorch_model-00002-of-00002.bin
model_type: LLAMA2
output_dir: /tmp/torchtune/llama2_7B/lora_dpo_single_device
recipe_checkpoint: null
compile: false
dataset:
_component_: torchtune.datasets.stack_exchange_paired_dataset
data_files: /tmp/stack-exchange-paired/data/rl/merged_rl.csv
source: csv
split: train[:10%]
device: npu
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.rlhf.loss.DPOLoss
lr_scheduler:
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 10
max_steps_per_epoch: 100
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /tmp/torchtune/llama2_7B/lora_dpo_single_device/logs
model:
_component_: torchtune.models.llama2.lora_llama2_7b
apply_lora_to_mlp: true
apply_lora_to_output: false
lora_alpha: 16
lora_attn_modules:
- q_proj
- v_proj
- output_proj
lora_dropout: 0.0
lora_rank: 8
optimizer:
_component_: torch.optim.AdamW
fused: false
lr: 0.0005
weight_decay: 0.05
output_dir: /tmp/torchtune/llama2_7B/lora_dpo_single_device
resume_from_checkpoint: false
save_adapter_weights_only: false
seed: null
shuffle: true
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
max_seq_len: 1024
path: /tmp/llama2-7b-hf/tokenizer.model
INFO:torchtune.utils._logging:Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3863624611. Local seed is seed + rank = 3863624611 + 0
Writing logs to /tmp/torchtune/llama2_7B/lora_dpo_single_device/logs/log_1736150859.txt
/home/anaconda3/envs/torchtune_npu/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if self.device.type != 'cpu':
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Memory stats after model init:
NPU peak memory allocation: 13.03 GiB
NPU peak memory reserved: 13.04 GiB
NPU peak memory active: 13.03 GiB
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:Optimizer and loss are initialized.
INFO:torchtune.utils._logging:Loss function is initialized.
Generating train split: 7435908 examples [05:25, 22865.95 examples/s]
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:Learning rate scheduler is initialized.
1|100|Loss: 0.5470260381698608: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [07:33<00:00, 4.23s/it]INFO:torchtune.utils._logging:Model checkpoint of size 9.29 GiB saved to /tmp/torchtune/llama2_7B/lora_dpo_single_device/epoch_0/ft-model-00001-of-00002.safetensors
INFO:torchtune.utils._logging:Model checkpoint of size 3.26 GiB saved to /tmp/torchtune/llama2_7B/lora_dpo_single_device/epoch_0/ft-model-00002-of-00002.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.03 GiB saved to /tmp/torchtune/llama2_7B/lora_dpo_single_device/epoch_0/adapter_model.pt
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.03 GiB saved to /tmp/torchtune/llama2_7B/lora_dpo_single_device/epoch_0/adapter_model.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.00 GiB saved to /tmp/torchtune/llama2_7B/lora_dpo_single_device/epoch_0/adapter_config.json
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
1|100|Loss: 0.5470260381698608: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [10:21<00:00, 6.21s/it]
Recipe: lora_finetune_single_device
- Model: Llama-3.2-1B-Instruct
- Config(Only list the main changes)
optimizer:
_component_: torch.optim.AdamW
fused: False # not supported on Ascend NPU device
device: npu
dtype: fp32
- Logs
(torchtune_npu) [root@localhost torchtune]# tune run lora_finetune_single_device --config llama3_2/1B_lora_single_device
INFO:torchtune.utils._logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:
batch_size: 4
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-3.2-1B-Instruct/
checkpoint_files:
- model.safetensors
model_type: LLAMA3_2
output_dir: /tmp/torchtune/llama3_2_1B/lora_single_device
recipe_checkpoint: null
compile: false
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
packed: false
source: /tmp/alpaca_data_cleaned
device: npu
dtype: fp32
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 2
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 10
max_steps_per_epoch: 100
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /tmp/torchtune/llama3_2_1B/lora_single_device/logs
model:
_component_: torchtune.models.llama3_2.lora_llama3_2_1b
apply_lora_to_mlp: true
lora_alpha: 128
lora_attn_modules:
- q_proj
- v_proj
- output_proj
lora_dropout: 0.0
lora_rank: 64
optimizer:
_component_: torch.optim.AdamW
fused: false
lr: 0.0003
weight_decay: 0.01
output_dir: /tmp/torchtune/llama3_2_1B/lora_single_device
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: /tmp/torchtune/llama3_2_1B/lora_single_device/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
save_adapter_weights_only: false
seed: null
shuffle: true
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1697766874. Local seed is seed + rank = 1697766874 + 0
Writing logs to /tmp/torchtune/llama3_2_1B/lora_single_device/logs/log_1736154976.txt
INFO:torchtune.utils._logging:Model is initialized with precision torch.float32.
INFO:torchtune.utils._logging:Memory stats after model init:
NPU peak memory allocation: 4.79 GiB
NPU peak memory reserved: 4.81 GiB
NPU peak memory active: 4.79 GiB
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:Optimizer and loss are initialized.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:Learning rate scheduler is initialized.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|100|Loss: 0.9386640191078186: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:58<00:00, 1.61it/s]INFO:torchtune.utils._logging:Starting checkpoint save...
INFO:torchtune.utils._logging:Model checkpoint of size 4.60 GiB saved to /tmp/torchtune/llama3_2_1B/lora_single_device/epoch_0/ft-model-00001-of-00001.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.16 GiB saved to /tmp/torchtune/llama3_2_1B/lora_single_device/epoch_0/adapter_model.pt
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.16 GiB saved to /tmp/torchtune/llama3_2_1B/lora_single_device/epoch_0/adapter_model.safetensors
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.00 GiB saved to /tmp/torchtune/llama3_2_1B/lora_single_device/epoch_0/adapter_config.json
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
INFO:torchtune.utils._logging:Checkpoint saved in 98.77 seconds.
1|100|Loss: 0.9386640191078186: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:36<00:00, 1.57s/it]
Recipe: quantize
- Model: Llama-2-7b-hf
- Config(Only list the main changes)
device: npu
- Logs
(torchtune_npu) [root@localhost torchtune]# tune run quantize --config quantization
INFO:torchtune.utils._logging:Running QuantizationRecipe with resolved config:
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/llama2-7b-hf
checkpoint_files:
- pytorch_model-00001-of-00002.bin
- pytorch_model-00002-of-00002.bin
model_type: LLAMA2
output_dir: /tmp/torchtune/llama2_7B/quantized
recipe_checkpoint: null
device: npu
dtype: bf16
model:
_component_: torchtune.models.llama2.llama2_7b
output_dir: /tmp/torchtune/llama2_7B/quantized
quantizer:
_component_: torchtune.training.quantization.Int8DynActInt4WeightQuantizer
groupsize: 256
seed: 1234
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 1234. Local seed is seed + rank = 1234 + 0
/home/anaconda3/envs/torchtune_npu/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if self.device.type != 'cpu':
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Time for quantization: 0.52 sec
INFO:torchtune.utils._logging:Memory used: 13.95 GB
INFO:torchtune.utils._logging:Model checkpoint of size 6.49 GiB saved to /tmp/torchtune/llama2_7B/quantized/pytorch_model-00001-of-00002-8da4w.pt
Feel free to provide valuable improvement suggestions! ☺️
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2234
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: No Failures
As of commit 83d53ad43d0b29b953917937b885dc3734c5c8e7 with merge base 213f38605ff0b7b1e20f85a9e032710be04c82c9 ():
:green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Hi @Nicorgi!
Thank you for your pull request and welcome to our community.
Action Required
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!
Hi @RdoubleA, @joecummings, @ebsmothers:
Could you please help to review this PR and give me some advice? Thank you for your time! 😄
It is very helpful for me, nice work!
Thanks @Nicorgi for the PR! Please give us 1-2 days as we catch up from the holiday backlog, we will review this soon!
Hi @RdoubleA, @joecummings, @ebsmothers:
Could you take some time to review my code? Thanks a lot. 😄
Fantastic work. I'd like to ask whether the Ascend NPU can be directly compatible with PyTorch.
Fantastic work. I'd like to ask whether the Ascend NPU can be directly compatible with PyTorch.
Hi @dz1iang, you can first pip install torch torch_npu and then import modules as shown below in your code.
import torch
import torch_npu
For more details, you can refer to our docs. Hope this can solve your problem. 🤗