tensorboard icon indicating copy to clipboard operation
tensorboard copied to clipboard

Tensorboard unable to capture profile for jax example

Open cfRod opened this issue 11 months ago • 1 comments

To report a problem with TensorBoard itself, please fill out the remainder of this template.

Environment information (required)

Please run diagnose_tensorboard.py (link below) in the same environment from which you normally run TensorFlow/TensorBoard, and paste the output here:

/JAX/xla/xla/service/cpu/benchmarks/e2e/gemma2/keras$ python diagnose_tensorboard.py

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version c6ca9f1d004e2a1bc7c160abc43be229b82cad7e

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='ip-10-252-30-225', release='6.8.0-1021-aws', version='#23~22.04.1-Ubuntu SMP Tue Dec 10 16:50:46 UTC 2024', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: '/home/../venv/gemma2-keras'

--- check: installed_packages
INFO: installed: tensorboard==2.18.0
INFO: installed: tensorflow==2.18.0
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']
INFO: installed: tensorboard-data-server==0.7.2

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.18.0'

--- check: tensorflow_python_version
2025-02-17 17:43:29.606821: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-17 17:43:29.616812: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1739814209.629483    7716 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739814209.632905    7716 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-17 17:43:29.644947: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO: tensorflow.__version__: '2.18.0'
INFO: tensorflow.__git_version__: 'v2.18.0-rc2-4-g6550e4bd802'

--- check: tensorboard_data_server_version
INFO: data server binary: '/home/.../venv/gemma2-keras/lib/python3.10/site-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.7.2'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/home/../venv/gemma2-keras/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::1', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'ip-10-252-30-225.eu-west-1.compute.internal'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=8278350, st_dev=66305, st_nlink=2, st_uid=1007, st_gid=1008, st_size=4096, st_atime=1739813704, st_mtime=1739814201, st_ctime=1739814201)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/home/.../venv/gemma2-keras/lib/python3.10/site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==2.1.0
astunparse==1.6.3
certifi==2024.12.14
charset-normalizer==3.4.1
etils==1.12.0
filelock==3.16.1
flatbuffers==24.12.23
fsspec==2024.12.0
gast==0.6.0
google-pasta==0.2.0
grpcio==1.69.0
gviz-api==1.10.0
h5py==3.12.1
idna==3.10
importlib_resources==6.5.2
jax==0.4.38
jaxlib==0.4.38
Jinja2==3.1.5
kagglehub==0.3.6
keras==3.8.0
keras-hub==0.18.1
keras-nlp==0.18.1
libclang==18.1.1
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
mdurl==0.1.2
ml-dtypes==0.4.1
mpmath==1.3.0
namex==0.0.8
networkx==3.4.2
numpy==2.0.2
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
opt_einsum==3.4.0
optree==0.13.1
packaging==24.2
pip==22.0.2
protobuf==4.25.6
Pygments==2.19.1
regex==2024.11.6
requests==2.32.3
rich==13.9.4
scipy==1.15.0
setuptools==59.6.0
six==1.17.0
sympy==1.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
tensorboard-plugin-profile==2.19.0
tensorflow==2.18.0
tensorflow-io-gcs-filesystem==0.37.1
tensorflow-text==2.18.1
termcolor==2.5.0
torch==2.5.1
tqdm==4.67.1
triton==3.1.0
typing_extensions==4.12.2
urllib3==2.3.0
Werkzeug==3.1.3
wheel==0.45.1
wrapt==1.17.0
zipp==3.21.0

Next steps

No action items identified. Please copy ALL of the above output, including the lines containing only backticks, into your GitHub issue or comment. Be sure to redact any sensitive information.

Issue description

I am running the example on the CPU provided here https://docs.jax.dev/en/latest/profiling.html

import jax

jax.profiler.start_trace("/tmp/tensorboard")

# Run the operations to be profiled
key = jax.random.key(0)
x = jax.random.normal(key, (5000, 5000))
y = x @ x
y.block_until_ready()

jax.profiler.stop_trace()

However I see no trace capture for the default example:

Image

cfRod avatar Feb 17 '25 17:02 cfRod

Hi @cfRod,

Unfortunately, only a few tools in TensorBoard supports XLA:CPU profiling right now: trace viewer and graph viewer.

To see the results, you can select trace viewer tool from the drop down.

Image

You can go to the graph viewer from the timeline by clicking on the HLO op you are interested in, there will be a link to the graph in the bottom right panel of the page.

Image

Example graph viewer screen:

Image

Framework op stats tool sometimes works (but didn't work in this case). We hope to fix this in the future. I don't have a timeline yet though.

penpornk avatar Feb 18 '25 22:02 penpornk