Update trainer to ensure type consistency for `train_args` and `lora_config`
What this PR does / why we need it:
Add data preprocessing for train_args and lora_config to ensure each parameter's type is consistent with the reference value. This will be necessary for developing the Katib tune API to optimize hyperparameters.
Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #
Checklist:
Detailed reason for this change:
We aim to reuse this trainer for the Katib LLM Hyperparameter Optimization API. Katib's controller substitutes hyperparameters with different values for each trial, and these values default to strings. This type inconsistency causes errors when running the trainer. Therefore, it is necessary to preprocess train_args and lora_config to ensure type consistency.
Example: When optimizing the learning rate, users set the parameters:
learning_rate = katib.search.double(min=1e-05, max=5e-05),
Arguments passed to the training container become:
--training_parameters '{..., "learning_rate": "3.355107835249428e-05", ...}'
This leads to the following error:
[rank0]: Traceback (most recent call last):
[rank0]: File "/Users/helen/Documents/05_GSoC/training-operator/sdk/python/kubeflow/trainer/hf_llm_training.py", line 196, in <module>
[rank0]: train_model(model, transformer_type, train_data, eval_data, tokenizer, train_args)
[rank0]: File "/Users/helen/Documents/05_GSoC/training-operator/sdk/python/kubeflow/trainer/hf_llm_training.py", line 147, in train_model
[rank0]: trainer.train()
[rank0]: File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1624, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1725, in _inner_training_loop
[rank0]: self.create_optimizer_and_scheduler(num_training_steps=max_steps)
[rank0]: File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 954, in create_optimizer_and_scheduler
[rank0]: self.create_optimizer()
[rank0]: File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/transformers/trainer.py", line 1001, in create_optimizer
[rank0]: self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages/torch/optim/adamw.py", line 29, in __init__
[rank0]: if not 0.0 <= lr:
[rank0]: ^^^^^^^^^
[rank0]: TypeError: '<=' not supported between instances of 'float' and 'str'
E0722 14:52:04.854000 7957912640 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 54960) of binary: /opt/homebrew/anaconda3/envs/katib-llm-test/bin/python
Pull Request Test Coverage Report for Build 10049294187
Details
- 0 of 0 changed or added relevant lines in 0 files are covered.
- No unchanged relevant lines lost coverage.
- Overall first build on helenxie/update-trainer at 35.406%
| Totals | |
|---|---|
| Change from base Build 9999203579: | 35.4% |
| Covered Lines: | 4378 |
| Relevant Lines: | 12365 |
💛 - Coveralls
I built the image of this trainer on my local computer and tried to test my example for Katib LLM Hyperparameter Optimization API which utilizes this trainer, it kept showing the following two errors:
Error 1:
I0724 18:03:43.553156 83 main.go:143] Traceback (most recent call last):
I0724 18:03:43.553167 83 main.go:143] File "/app/hf_llm_training.py", line 169, in <module>
I0724 18:03:43.553219 83 main.go:143] train_args = TrainingArguments(**json.loads(args.training_parameters))
I0724 18:03:43.553226 83 main.go:143] File "<string>", line 123, in __init__
I0724 18:03:43.553332 83 main.go:143] File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1528, in __post_init__
I0724 18:03:43.553968 83 main.go:143] and (self.device.type != "cuda")
I0724 18:03:43.553974 83 main.go:143] File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1995, in device
I0724 18:03:43.554210 83 main.go:143] return self._setup_devices
I0724 18:03:43.554219 83 main.go:143] File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 56, in __get__
I0724 18:03:43.554223 83 main.go:143] cached = self.fget(obj)
I0724 18:03:43.554297 83 main.go:143] File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1914, in _setup_devices
I0724 18:03:43.554645 83 main.go:143] self.distributed_state = PartialState(cpu=True, backend=self.ddp_backend)
I0724 18:03:43.554655 83 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 275, in __init__
I0724 18:03:43.554693 83 main.go:143] self.set_device()
I0724 18:03:43.554698 83 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 786, in set_device
I0724 18:03:43.554764 83 main.go:143] device_module.set_device(self.device)
I0724 18:03:43.554769 83 main.go:143] AttributeError: module 'torch.cpu' has no attribute 'set_device'. Did you mean: '_device'?
I0724 18:03:48.279502 83 main.go:143] [2024-07-24 18:03:48,275] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 66) of binary: /usr/bin/python
Error 2:
0724 21:07:37.439131 69 main.go:143] Traceback (most recent call last):
I0724 21:07:37.439150 69 main.go:143] File "/app/hf_llm_training.py", line 9, in <module>
I0724 21:07:37.439158 69 main.go:143] from peft import LoraConfig, get_peft_model
I0724 21:07:37.439161 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/peft/__init__.py", line 22, in <module>
I0724 21:07:37.439166 69 main.go:143] from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model
I0724 21:07:37.439169 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/peft/mapping.py", line 16, in <module>
I0724 21:07:37.439248 69 main.go:143] from .peft_model import (
I0724 21:07:37.439258 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 22, in <module>
I0724 21:07:37.439262 69 main.go:143] from accelerate import dispatch_model, infer_auto_device_map
I0724 21:07:37.439263 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/__init__.py", line 16, in <module>
I0724 21:07:37.439268 69 main.go:143] from .accelerator import Accelerator
I0724 21:07:37.439269 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 36, in <module>
I0724 21:07:37.439308 69 main.go:143]
I0724 21:07:37.439336 69 main.go:143] from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
I0724 21:07:37.439342 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/checkpointing.py", line 24, in <module>
I0724 21:07:37.439346 69 main.go:143] from .utils import (
I0724 21:07:37.439348 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/__init__.py", line 190, in <module>
I0724 21:07:37.439407 69 main.go:143] from .bnb import has_4bit_bnb_layers, load_and_quantize_model
I0724 21:07:37.439412 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/bnb.py", line 29, in <module>
I0724 21:07:37.439432 69 main.go:143] from ..big_modeling import dispatch_model, init_empty_weights
I0724 21:07:37.439437 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py", line 24, in <module>
I0724 21:07:37.439468 69 main.go:143] from .hooks import (
I0724 21:07:37.439475 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 30, in <module>
I0724 21:07:37.439497 69 main.go:143] from .utils.other import recursive_getattr
I0724 21:07:37.439509 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/other.py", line 36, in <module>
I0724 21:07:37.439540 69 main.go:143] from .transformer_engine import convert_model
I0724 21:07:37.439545 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/transformer_engine.py", line 21, in <module>
I0724 21:07:37.439564 69 main.go:143] import transformer_engine.pytorch as te
I0724 21:07:37.439568 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/__init__.py", line 6, in <module>
I0724 21:07:37.439572 69 main.go:143] from .module import LayerNormLinear
I0724 21:07:37.439573 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/__init__.py", line 6, in <module>
I0724 21:07:37.439594 69 main.go:143] from .layernorm_linear import LayerNormLinear
I0724 21:07:37.439598 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/layernorm_linear.py", line 15, in <module>
I0724 21:07:37.439616 69 main.go:143] from .. import cpp_extensions as tex
I0724 21:07:37.439620 69 main.go:143] File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/cpp_extensions/__init__.py", line 6, in <module>
I0724 21:07:37.439639 69 main.go:143] from transformer_engine_extensions import *
I0724 21:07:37.439640 69 main.go:143] ImportError: libc10_cuda.so: cannot open shared object file: No such file or directory
I0724 21:07:37.786588 69 main.go:143] E0724 21:07:37.786000 281473339512928 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 59) of binary: /usr/bin/python
I guess the error came from the base image, so I updated the version of base image in the Dockerfile as FROM nvcr.io/nvidia/pytorch:24.06-py3, and it works perfect now.
I'm wondering if anyone else met the same problem.
Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer? Thoughts @andreyvelich @johnugeorge
Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer?
I am not sure if that would be possible since user might use various part of Pod spec to pass the HPs.
For example, they can use env variable to pass the HPs for the container, and envVar supports only string values: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2315C2-L2315C7
We are doing substitution for the Trial template here: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/manifest/generator.go#L129-L135
As you can see we don't do any type check before substitution.
@tenzen-y @johnugeorge Do you have any suggestion on the above ?
Instead of type-casting the params here in training operator, Shall we take a look at Katib API and see why Katib is translating everything to string and fix the issue at that layer?
I am not sure if that would be possible since user might use various part of Pod spec to pass the HPs. For example, they can use env variable to pass the HPs for the container, and
envVarsupports only string values: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2315C2-L2315C7We are doing substitution for the Trial template here: https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/manifest/generator.go#L129-L135
As you can see we don't do any type check before substitution.
@tenzen-y @johnugeorge Do you have any suggestion on the above ?
Ok I see. Its hard to do with Env variables.
/area gsoc
Hi @helenxie-bit
How are you passing values to the training container in katib?
Is it possible to pass values like how they are being passed in training operator.
"--training_parameters", json.dumps(train_parameters.training_parameters.to_dict()),
Hi @helenxie-bit How are you passing values to the training container in katib? Is it possible to pass values like how they are being passed in training operator.
"--training_parameters", json.dumps(train_parameters.training_parameters.to_dict()),
@deepanker13 Yes, I implemented it exactly the same way: https://github.com/kubeflow/katib/blob/61dc8ca1d9e8bec88c3ebc210c0e9b6b587f563a/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py#L672. However, there is a difference between how Katib and the Training Operator handle the arguments due to Katib's hyperparameter substitution.
For example, when optimizing learning_rate, the user would set the parameters like this:
trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
...
learning_rate = katib.search.double(min=1e-05, max=5e-05),
...
),
...
)
Katib applies hyperparameter substitution and uses json.dumps(train_parameters.training_parameters.to_dict()), resulting in:
..."learning_rate": "${trialParameters.learning_rate}", ...
The Katib controller then sets the value for each trial according to the suggestion, so the training container ultimately receives:
--training_parameters '{..., "learning_rate": "3.355107835249428e-05", ...}'
As you can see, the value of learning_rate is a string instead of float, which is why we need to add data preprocessing inside the trainer.
/lgtm
thanks @helenxie-bit
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: andreyvelich
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~sdk/python/OWNERS~~ [andreyvelich]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment