maxtext
maxtext copied to clipboard
Add more tests for Mixtral
Description
Add 2 more Mixtral tests per PR's request (along with this PR):
- to generate unscanned ckpt
- to run pre-training
Note: I was not able to get it work for decoding with unscanned ckpt on multihosts. I got a few interesting errors below. Asked in the group chat, and it seems no clue yet.
So add a TODO in the test script first, and take a deep look.
Error from try #1:
2024-04-30 03:13:45.121096: I external/xla/xla/pjrt/distributed/client.cc:134] Distributed task shutdown initiated.
2024-04-30 03:13:45.122686: I external/xla/xla/pjrt/distributed/client.cc:136] Distributed task shutdown result: OK
2024-04-30 03:13:45.122715: I external/tsl/tsl/distributed_runtime/preemption/preemption_sync_manager.cc:168] Cancelled call to retrieve preemption notice. This is expected upon program shutdown.
Error from try #2:
[2024-04-28, 05:35:50 UTC] {xpk.py:157} INFO - W0000 00:00:1714282001.278761 9853 curl_transport.cc:394] Error [56]=Failure when receiving data from the peer in curl operation
Test
Upload to Airflow, and test passes - link