maxtext icon indicating copy to clipboard operation
maxtext copied to clipboard

Add more tests for Mixtral

Open RissyRan opened this issue 1 year ago • 0 comments

Description

Add 2 more Mixtral tests per PR's request (along with this PR):

  • to generate unscanned ckpt
  • to run pre-training

Note: I was not able to get it work for decoding with unscanned ckpt on multihosts. I got a few interesting errors below. Asked in the group chat, and it seems no clue yet.

So add a TODO in the test script first, and take a deep look.

Error from try #1:
2024-04-30 03:13:45.121096: I external/xla/xla/pjrt/distributed/client.cc:134] Distributed task shutdown initiated.
2024-04-30 03:13:45.122686: I external/xla/xla/pjrt/distributed/client.cc:136] Distributed task shutdown result: OK
2024-04-30 03:13:45.122715: I external/tsl/tsl/distributed_runtime/preemption/preemption_sync_manager.cc:168] Cancelled call to retrieve preemption notice. This is expected upon program shutdown.

Error from try #2:
[2024-04-28, 05:35:50 UTC] {xpk.py:157} INFO - W0000 00:00:1714282001.278761    9853 curl_transport.cc:394] Error [56]=Failure when receiving data from the peer in curl operation

Test

Upload to Airflow, and test passes - link

RissyRan avatar Apr 30 '24 03:04 RissyRan