training-operator Update trainer to ensure type consistency for `train_args` and `lora

What this PR does / why we need it: Add data preprocessing for train_args and lora_config to ensure each parameter's type is consistent with the reference value. This will be necessary for developing the Katib tune API to optimize hyperparameters.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes #

Checklist:

Jul 22 '24 21:07 helenxie-bit

Detailed reason for this change: We aim to reuse this trainer for the Katib LLM Hyperparameter Optimization API. Katib's controller substitutes hyperparameters with different values for each trial, and these values default to strings. This type inconsistency causes errors when running the trainer. Therefore, it is necessary to preprocess train_args and lora_config to ensure type consistency.

Example: When optimizing the learning rate, users set the parameters:

learning_rate = katib.search.double(min=1e-05, max=5e-05),

Arguments passed to the training container become:

--training_parameters '{..., "learning_rate": "3.355107835249428e-05", ...}'

This leads to the following error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/Users/helen/Documents/05_GSoC/training-operator/sdk/python/kubeflow/trainer/hf_llm_training.py", line 196, in <module>
[rank0]:     train_model(model, transformer_type, train_data, eval_data, tokenizer, train_args)
[rank0]:   File "/Users/helen/Documents/05_GSoC/training-operator/sdk/python/kubeflow/trainer/hf_llm_training.py", line 147, in train_model
[rank0]:     trainer.train()
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1624, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1725, in _inner_training_loop
[rank0]:     self.create_optimizer_and_scheduler(num_training_steps=max_steps)
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 954, in create_optimizer_and_scheduler
[rank0]:     self.create_optimizer()
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1001, in create_optimizer
[rank0]:     self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/torch/optim/adamw.py", line 29, in __init__
[rank0]:     if not 0.0 <= lr:
[rank0]:            ^^^^^^^^^
[rank0]: TypeError: '<=' not supported between instances of 'float' and 'str'
E0722 14:52:04.854000 7957912640 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 54960) of binary: /opt/homebrew/anaconda3/envs/katib-llm-test/bin/python

Jul 22 '24 22:07 helenxie-bit

Pull Request Test Coverage Report for Build 10049294187

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall first build on helenxie/update-trainer at 35.406%

Totals
Change from base Build 9999203579:	35.4%
Covered Lines:	4378
Relevant Lines:	12365

💛 - Coveralls

Jul 22 '24 22:07 coveralls

I built the image of this trainer on my local computer and tried to test my example for Katib LLM Hyperparameter Optimization API which utilizes this trainer, it kept showing the following two errors:

Error 1:

I0724 18:03:43.553156      83 main.go:143]   Traceback (most recent call last):
I0724 18:03:43.553167      83 main.go:143]   File "/app/hf_llm_training.py", line 169, in <module>
I0724 18:03:43.553219      83 main.go:143]     train_args = TrainingArguments(**json.loads(args.training_parameters))
I0724 18:03:43.553226      83 main.go:143]   File "<string>", line 123, in __init__
I0724 18:03:43.553332      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1528, in __post_init__
I0724 18:03:43.553968      83 main.go:143]     and (self.device.type != "cuda")
I0724 18:03:43.553974      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1995, in device
I0724 18:03:43.554210      83 main.go:143]     return self._setup_devices
I0724 18:03:43.554219      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 56, in __get__
I0724 18:03:43.554223      83 main.go:143]     cached = self.fget(obj)
I0724 18:03:43.554297      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1914, in _setup_devices
I0724 18:03:43.554645      83 main.go:143]     self.distributed_state = PartialState(cpu=True, backend=self.ddp_backend)
I0724 18:03:43.554655      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 275, in __init__
I0724 18:03:43.554693      83 main.go:143]     self.set_device()
I0724 18:03:43.554698      83 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 786, in set_device
I0724 18:03:43.554764      83 main.go:143]     device_module.set_device(self.device)
I0724 18:03:43.554769      83 main.go:143] AttributeError: module 'torch.cpu' has no attribute 'set_device'. Did you mean: '_device'?
I0724 18:03:48.279502      83 main.go:143] [2024-07-24 18:03:48,275] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 66) of binary: /usr/bin/python

Error 2:

0724 21:07:37.439131      69 main.go:143]    Traceback (most recent call last):
I0724 21:07:37.439150      69 main.go:143]   File "/app/hf_llm_training.py", line 9, in <module>
I0724 21:07:37.439158      69 main.go:143]     from peft import LoraConfig, get_peft_model
I0724 21:07:37.439161      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/peft/__init__.py", line 22, in <module>
I0724 21:07:37.439166      69 main.go:143]     from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model
I0724 21:07:37.439169      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/peft/mapping.py", line 16, in <module>
I0724 21:07:37.439248      69 main.go:143]     from .peft_model import (
I0724 21:07:37.439258      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 22, in <module>
I0724 21:07:37.439262      69 main.go:143]     from accelerate import dispatch_model, infer_auto_device_map
I0724 21:07:37.439263      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/__init__.py", line 16, in <module>
I0724 21:07:37.439268      69 main.go:143]     from .accelerator import Accelerator
I0724 21:07:37.439269      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 36, in <module>
I0724 21:07:37.439308      69 main.go:143]     
I0724 21:07:37.439336      69 main.go:143] from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
I0724 21:07:37.439342      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/checkpointing.py", line 24, in <module>
I0724 21:07:37.439346      69 main.go:143]     from .utils import (
I0724 21:07:37.439348      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/__init__.py", line 190, in <module>
I0724 21:07:37.439407      69 main.go:143]     from .bnb import has_4bit_bnb_layers, load_and_quantize_model
I0724 21:07:37.439412      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/bnb.py", line 29, in <module>
I0724 21:07:37.439432      69 main.go:143]     from ..big_modeling import dispatch_model, init_empty_weights
I0724 21:07:37.439437      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py", line 24, in <module>
I0724 21:07:37.439468      69 main.go:143]     from .hooks import (
I0724 21:07:37.439475      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 30, in <module>
I0724 21:07:37.439497      69 main.go:143]     from .utils.other import recursive_getattr
I0724 21:07:37.439509      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/other.py", line 36, in <module>
I0724 21:07:37.439540      69 main.go:143]     from .transformer_engine import convert_model
I0724 21:07:37.439545      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/transformer_engine.py", line 21, in <module>
I0724 21:07:37.439564      69 main.go:143]     import transformer_engine.pytorch as te
I0724 21:07:37.439568      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/__init__.py", line 6, in <module>
I0724 21:07:37.439572      69 main.go:143]     from .module import LayerNormLinear
I0724 21:07:37.439573      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/__init__.py", line 6, in <module>
I0724 21:07:37.439594      69 main.go:143]     from .layernorm_linear import LayerNormLinear
I0724 21:07:37.439598      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/layernorm_linear.py", line 15, in <module>
I0724 21:07:37.439616      69 main.go:143]     from .. import cpp_extensions as tex
I0724 21:07:37.439620      69 main.go:143]   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/cpp_extensions/__init__.py", line 6, in <module>
I0724 21:07:37.439639      69 main.go:143]     from transformer_engine_extensions import *
I0724 21:07:37.439640      69 main.go:143] ImportError: libc10_cuda.so: cannot open shared object file: No such file or directory
I0724 21:07:37.786588      69 main.go:143] E0724 21:07:37.786000 281473339512928 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 59) of binary: /usr/bin/python

I guess the error came from the base image, so I updated the version of base image in the Dockerfile as FROM nvcr.io/nvidia/pytorch:24.06-py3, and it works perfect now.

I'm wondering if anyone else met the same problem.

Jul 25 '24 09:07 helenxie-bit

Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer? Thoughts @andreyvelich @johnugeorge

Aug 02 '24 19:08 nsingl00

Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer?

I am not sure if that would be possible since user might use various part of Pod spec to pass the HPs. For example, they can use env variable to pass the HPs for the container, and envVar supports only string values: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2315C2-L2315C7

We are doing substitution for the Trial template here: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/manifest/generator.go#L129-L135

As you can see we don't do any type check before substitution.

@tenzen-y @johnugeorge Do you have any suggestion on the above ?

Aug 05 '24 18:08 andreyvelich

Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer?

I am not sure if that would be possible since user might use various part of Pod spec to pass the HPs. For example, they can use env variable to pass the HPs for the container, and envVar supports only string values: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2315C2-L2315C7

We are doing substitution for the Trial template here: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/manifest/generator.go#L129-L135

As you can see we don't do any type check before substitution.

@tenzen-y @johnugeorge Do you have any suggestion on the above ?

Ok I see. Its hard to do with Env variables.

Aug 06 '24 00:08 nsingl00

/area gsoc

Aug 07 '24 09:08 helenxie-bit

Hi @helenxie-bit How are you passing values to the training container in katib? Is it possible to pass values like how they are being passed in training operator. "--training_parameters", json.dumps(train_parameters.training_parameters.to_dict()),

Aug 08 '24 11:08 deepanker13

Hi @helenxie-bit How are you passing values to the training container in katib? Is it possible to pass values like how they are being passed in training operator. "--training_parameters", json.dumps(train_parameters.training_parameters.to_dict()),

@deepanker13 Yes, I implemented it exactly the same way: https://github.com/kubeflow/katib/blob/61dc8ca1d9e8bec88c3ebc210c0e9b6b587f563a/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L672. However, there is a difference between how Katib and the Training Operator handle the arguments due to Katib's hyperparameter substitution.

For example, when optimizing learning_rate, the user would set the parameters like this:

trainer_parameters=HuggingFaceTrainerParams(
        training_parameters=transformers.TrainingArguments(
            ...
            learning_rate = katib.search.double(min=1e-05, max=5e-05),
            ...
        ),
       ...
    )

Katib applies hyperparameter substitution and uses json.dumps(train_parameters.training_parameters.to_dict()), resulting in:

..."learning_rate": "${trialParameters.learning_rate}", ...

The Katib controller then sets the value for each trial according to the suggestion, so the training container ultimately receives:

--training_parameters '{..., "learning_rate": "3.355107835249428e-05", ...}'

As you can see, the value of learning_rate is a string instead of float, which is why we need to add data preprocessing inside the trainer.

Aug 08 '24 14:08 helenxie-bit

/lgtm

Aug 12 '24 02:08 deepanker13

thanks @helenxie-bit

Aug 12 '24 02:08 deepanker13

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sdk/python/OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Aug 12 '24 12:08 google-oss-prow[bot]

Update trainer to ensure type consistency for `train_args` and `lora_config`

Pull Request Test Coverage Report for Build 10049294187

Details

💛 - Coveralls