xla icon indicating copy to clipboard operation
xla copied to clipboard

Simplify device count external API calls

Open pgmoka opened this issue 11 months ago • 4 comments

Currently there are many external APIs related getting the number of devices associate with PyTorch XLA. Those that I could find were:

  • "global_runtime_device_count": returns the total number of devices across all processes/hosts, but it has "@functools.lru_cache()"
  • "global_device_count": returns the total number of devices across all processes/hosts, but it has "@functools.lru_cache()"
  • "addressable_runtime_device_count": Access number of addressable devices visible to a process.
  • "addressable_device_count": Access number of addressable devices visible to a process. It specifically returns 1 in case of SPMD.
  • "local_device_count": takes the number of addressable devices and multiplies it by the number of local process counts. Equivalent of the answer of the number of devices running on a host.

From these, some existing observations are:

  • addressable_runtime_device_count and addressable_device_count are extremely similar in implementation and name. Perhaps we should make the distinction more clear. Perhaps there is some context around addressable_device_count particular I don't fully grasp.
  • local_device_count terminology can be confusing when compared with JAX's concept for local devices for jax.local_devices. local_device_count being the number of devices in the host, while JAX's definition is of devices in the process
  • We should deduplicate global_runtime_device_count and global_device_count, just have one reference the other to remove multiple calls

pgmoka avatar May 19 '25 19:05 pgmoka

Related issues: #7653 #7657 #7658 cc @zpcore

ysiraichi avatar May 20 '25 12:05 ysiraichi

Related comment in https://github.com/pytorch/xla/pull/9184/files#r2115084512

pgmoka avatar May 30 '25 15:05 pgmoka

is there a plan on how to consolidates the APIs? if yes. maybe i can work on the implementation.

iwknow avatar Jun 01 '25 07:06 iwknow

I believe that would be part of the issue, and would be an interesting item to familiarize yourself with all the different APIs. It would likely entail breaking it down and illustrating all the relevant APIs [1][2][3][4], and what the suggested replacement and deprecation would be. I think this might be needed first to obtain the consensus here prior to the implementation.

rpsilva-aws avatar Jun 04 '25 05:06 rpsilva-aws