Failing to test "IntelTensorFlow_for_LLMs" sample in CI
Summary
Provide a short summary of the issue. Sections below provide guidance on what factors are considered important to reproduce an issue.
The "IntelTensorFlow_for_LLMs" sample takes ~5hrs to run on CI. Hence, the sample timeouts in CI.
Environment
OS: Linux
Observed behavior
The sample shows how to finetune a 6B model, which takes ~5hrs on CPU. This makes it hard to test the sample on CI.
Expected behavior
Ideally, the sample should not take more than few minutes to run. We can use environment variable to check if the sample is running on CI and run it for a few batches in CI. This would be more than enough to test the correctness of the code sample.
Hi @Ankur-singh, I'm the team lead for oneAPI_CS_Team4. My team would like to look into this issue for the hackathon.
Per my understanding, increasing the train/eval batch size will result in faster training but lower accuracy, which is a reasonable tradeoff for running it in a CI environment.
I tested by increasing batch size from 64 to 256 in the TrainingArgs class in GPTJ_finetuning.py:
self.per_device_train_batch_size=256
self.per_device_eval_batch_size=256
Here's a sample output:
2025-03-05 09:42:55.939159: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
5/33 [===>..........................] - ETA: 42:18 - loss: 1.2633 - accuracy: 0.600
Maybe you could test this in your CI environment and see how long it takes?
Also, may I know what is the environment variable to check if the sample is running in CI?
Thanks!
@fongjiantan all tensorflow samples are moved to a separate repo. IMO this is not a priority. You should check with @jimmytwei.