accelerate Cannot switch back to CPU training aftering doing TPU training

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-5.13.0-1023-gcp-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.22.4
- PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (False)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: TPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

First, configure Accelerate to do TPU training

$ pipenv run accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 3
What is the name of the function in your script that should be launched in all parallel scripts? [main]:
How many TPU cores should be used for distributed training? [1]:

Next, run the example script

$ pipenv run accelerate launch accelerate/examples/nlp_example.py
... output omitted ...

Then, configure Accelerate to do CPU training

$ pipenv run accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 1
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
How many CPU(s) should be used for distributed training? [1]:96
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: NO

Finally, run the example script again

$ pipenv run accelerate launch accelerate/examples/nlp_example.py
... output omitted ...

Expected behavior

After I specified CPU training, the last run still outputs something like the following


Reusing dataset cifar100 (/home/qys/.cache/huggingface/datasets/cifar100/cifar100/1.0.0/f365c8b725c23e8f0f8d725c3641234d9331cd2f62919d1381d1baa5b3ba3142)
Loading cached processed dataset at /home/qys/.cache/huggingface/datasets/cifar100/cifar100/1.0.0/f365c8b725c23e8f0f8d725c3641234d9331cd2f62919d1381d1baa5b3ba3142/cache-edd23acaf2e749df.arrow
2022-06-18 14:15:25.376688: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-18 14:15:25.376774: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey

Clearly, it's still using TPU. How can I re-config Accelerate to use CPUs only?

Jun 18 '22 19:06 nalzok

Here is some extra information you may find useful.

$ cat ~/.cache/huggingface/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_CPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 96
use_cpu: true

$ pipenv run accelerate test

Running:  accelerate-launch --config_file=None /home/qys/Research/embedder/.venv/lib/python3.8/site-packages/accelerate/test_utils/test_script.py
stderr: 2022-06-18 14:49:10.634148: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
stderr: 2022-06-18 14:49:10.634203: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: TPU
stdout: Num processes: 1
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: xla:1
stdout:
stdout:
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout:
stdout: **DataLoader integration test**
stdout: 0 tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]) <class 'torch_xla.distributed.parallel_loader.MpDeviceLoader'>
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout: Non-shuffled central dataloader passing.
stdout: Shuffled central dataloader passing.
stdout:
stdout: **Training integration test**
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Legacy FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Test is a success! You are ready for your distributed training!

Jun 18 '22 19:06 nalzok

@nalzok it is actually performing CPU training, however upon importing torch it still warms up the TPU regardless since it acts as a hook, that is why you see this.

To prove this I added the following code in the training function:

def training_function(config, args):
    # Initialize accelerator
    accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
    print(f'DEVICE: {accelerator.device}') # Added what we train on
    print(f'NUM_PROCESSES: {accelerator.num_processes}') # Added the number of processes

And launched it via:

accelerate launch --num_processes 1 accelerate/examples/nlp_example.py --cpu

And it printed out the right information that it was being trained on.

Jun 21 '22 20:06 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 19 '22 15:07 github-actions[bot]

I think it would be better if we can avoid loading the TPU library altogether, because from my understanding, "warming up" the TPU would make it unusable to other processes.

Jul 22 '22 08:07 nalzok

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 15 '22 15:08 github-actions[bot]