Ran Ran comments

Results 16 comments of


                                            Ran Ran

Imagenet example does not work with newer flax versions

The training still hangs there with the latest change. Full logs: ``` 2023-10-11 01:27:34.284670: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No...

Imagenet example does not work with newer flax versions

I think this is related to the checkpoint saving/restoring. In the previous logs, it mentions `Saving checkpoint at step: 1`; while during restoring, I noticed both: * `Restoring orbax checkpoint...

Adding 2.12.1 tests to the dashboard

Thanks for adding `tf.` to `tf.nightly-se`

Adding 2.12.1 tests to the dashboard

Hi @hyeygit, Chandra is updating `nightly-se` to `tf.nightly-se` for consistence. Could you send a separate a PR to address that in your tests? Otherwise, SE tests won't be shown on...

Adding Mixtral-8x22b

Are we good to start review? If so, please mark it as ready, and assign it to @RissyRan @gobbleturk and @ZhiyuLi-goog. Thanks!

Flash attention - head_dim 64

Hi, thanks for reaching out! Could you provide more detailed logs for `not implemented` error with 64 dim? Yeah, padding may be needed based on hardware design for 192 dims....

Flash attention - head_dim 64

The recent change is merged, please have a try, https://github.com/jax-ml/jax/pull/30862

MFU drops significantly when using megablox with more experts

Thanks for reaching out! It seems you have tuned a little bit on this general tile size ([here](https://github.com/AI-Hypercomputer/maxtext/blob/f69734088f4746a0507646be287f4f57e5e174d7/MaxText/layers/linears.py#L403)), but I'd like to mention this size could be very different based...

MFU drops significantly when using megablox with more experts

Thanks for the info! Yes, ideally, we should see pallas_call as top operations. Our team is working DeepSeek-like model config, and have onboarded some functional features recently. We are also...

MFU drops significantly when using megablox with more experts

Thanks for reaching out! We did some internal benchmarks about DeepSeek v3 and Llama4 Maverick on [Cloud v5p](https://cloud.google.com/tpu/docs/v5p), using megablox, adamw, dtype=bf16, weight_dtype=f32, and FSDP sharding. The performance is around...