Cannot switch back to CPU training aftering doing TPU training
System Info
- `Accelerate` version: 0.10.0
- Platform: Linux-5.13.0-1023-gcp-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.22.4
- PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (False)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: TPU
- mixed_precision: no
- use_cpu: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [ ] My own task or dataset (give details below)
Reproduction
First, configure Accelerate to do TPU training
$ pipenv run accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 3
What is the name of the function in your script that should be launched in all parallel scripts? [main]:
How many TPU cores should be used for distributed training? [1]:
Next, run the example script
$ pipenv run accelerate launch accelerate/examples/nlp_example.py
... output omitted ...
Then, configure Accelerate to do CPU training
$ pipenv run accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 1
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
How many CPU(s) should be used for distributed training? [1]:96
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: NO
Finally, run the example script again
$ pipenv run accelerate launch accelerate/examples/nlp_example.py
... output omitted ...
Expected behavior
After I specified CPU training, the last run still outputs something like the following
Reusing dataset cifar100 (/home/qys/.cache/huggingface/datasets/cifar100/cifar100/1.0.0/f365c8b725c23e8f0f8d725c3641234d9331cd2f62919d1381d1baa5b3ba3142)
Loading cached processed dataset at /home/qys/.cache/huggingface/datasets/cifar100/cifar100/1.0.0/f365c8b725c23e8f0f8d725c3641234d9331cd2f62919d1381d1baa5b3ba3142/cache-edd23acaf2e749df.arrow
2022-06-18 14:15:25.376688: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-18 14:15:25.376774: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
Clearly, it's still using TPU. How can I re-config Accelerate to use CPUs only?
Here is some extra information you may find useful.
$ cat ~/.cache/huggingface/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_CPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 96
use_cpu: true
$ pipenv run accelerate test
Running: accelerate-launch --config_file=None /home/qys/Research/embedder/.venv/lib/python3.8/site-packages/accelerate/test_utils/test_script.py
stderr: 2022-06-18 14:49:10.634148: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
stderr: 2022-06-18 14:49:10.634203: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: TPU
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: xla:1
stdout:
stdout:
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout:
stdout: **DataLoader integration test**
stdout: 0 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]) <class 'torch_xla.distributed.parallel_loader.MpDeviceLoader'>
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout: Non-shuffled central dataloader passing.
stdout: Shuffled central dataloader passing.
stdout:
stdout: **Training integration test**
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Legacy FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Test is a success! You are ready for your distributed training!
@nalzok it is actually performing CPU training, however upon importing torch it still warms up the TPU regardless since it acts as a hook, that is why you see this.
To prove this I added the following code in the training function:
def training_function(config, args):
# Initialize accelerator
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
print(f'DEVICE: {accelerator.device}') # Added what we train on
print(f'NUM_PROCESSES: {accelerator.num_processes}') # Added the number of processes
And launched it via:
accelerate launch --num_processes 1 accelerate/examples/nlp_example.py --cpu
And it printed out the right information that it was being trained on.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I think it would be better if we can avoid loading the TPU library altogether, because from my understanding, "warming up" the TPU would make it unusable to other processes.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.