Failing to test "IntelTensorFlow_for_LLMs" sample in CI

Open Ankur-singh opened this issue 1 year ago • 2 comments

Summary

Provide a short summary of the issue. Sections below provide guidance on what factors are considered important to reproduce an issue.

The "IntelTensorFlow_for_LLMs" sample takes ~5hrs to run on CI. Hence, the sample timeouts in CI.

Environment

OS: Linux

Observed behavior

The sample shows how to finetune a 6B model, which takes ~5hrs on CPU. This makes it hard to test the sample on CI.

Expected behavior

Ideally, the sample should not take more than few minutes to run. We can use environment variable to check if the sample is running on CI and run it for a few batches in CI. This would be more than enough to test the correctness of the code sample.

Aug 12 '24 18:08 Ankur-singh

Hi @Ankur-singh, I'm the team lead for oneAPI_CS_Team4. My team would like to look into this issue for the hackathon.

Per my understanding, increasing the train/eval batch size will result in faster training but lower accuracy, which is a reasonable tradeoff for running it in a CI environment.

I tested by increasing batch size from 64 to 256 in the TrainingArgs class in GPTJ_finetuning.py:

self.per_device_train_batch_size=256
self.per_device_eval_batch_size=256

Here's a sample output:

2025-03-05 09:42:55.939159: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
 5/33 [===>..........................] - ETA: 42:18 - loss: 1.2633 - accuracy: 0.600

Maybe you could test this in your CI environment and see how long it takes?

Also, may I know what is the environment variable to check if the sample is running in CI?

Thanks!

Mar 05 '25 10:03 fongjiantan

@fongjiantan all tensorflow samples are moved to a separate repo. IMO this is not a priority. You should check with @jimmytwei.

Mar 06 '25 01:03 Ankur-singh