updated kaggle-Gemma3_(4B) notebook
Set tokenize=False in tokenizer.apply_chat_template It won't run otherwise since the tokenizing is happening twice. I faced this minor issue when running today.
def apply_chat_template(examples):
texts = tokenizer.apply_chat_template(examples["conversations"], tokenize = False) # here
return { "text" : texts }
pass
tokenizer = get_chat_template(
tokenizer,
tokenize = False, # and here (inference and two other following cells)
chat_template = "gemma-3",
)
It works fine in my case without the specification. Can you give the error maybe?
It works fine in my case without the specification. Can you give the error maybe?
Like I mentioned, it won't run since the tokenizing is happening twice.
Without specification:
Error later on:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-19-ada16cbe0bb7> in <cell line: 2>()
1 from trl import SFTTrainer, SFTConfig
----> 2 trainer = SFTTrainer(
3 model = model,
4 tokenizer = tokenizer,
5 train_dataset = dataset,
/usr/local/lib/python3.10/dist-packages/unsloth/trainer.py in new_init(self, *args, **kwargs)
201 kwargs["args"] = config
202 pass
--> 203 original_init(self, *args, **kwargs)
204 pass
205 return new_init
/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func, **kwargs)
1008 fix_zero_training_loss(model, tokenizer, train_dataset)
1009
-> 1010 super().__init__(
1011 model = model,
1012 args = args,
/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py in wrapped_func(*args, **kwargs)
170 warnings.warn(message, FutureWarning, stacklevel=2)
171
--> 172 return func(*args, **kwargs)
173
174 return wrapped_func
/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizers, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func)
458 preprocess_dataset = args.dataset_kwargs is None or not args.dataset_kwargs.get("skip_prepare_dataset", False)
459 if preprocess_dataset:
--> 460 train_dataset = self._prepare_dataset(
461 train_dataset, processing_class, args, args.packing, formatting_func, "train"
462 )
/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in _prepare_dataset(self, dataset, processing_class, args, packing, formatting_func, dataset_name)
704
705 if bos_token is not None:
--> 706 if test_text.startswith(bos_token) or bos_token in chat_template:
707 add_special_tokens = False
708 print("Unsloth: We found double BOS tokens - we shall remove one automatically.")
AttributeError: 'int' object has no attribute 'startswith'
With specification:
@Erland366 @naimur-29 he is right in that tokenize=False should be used here. That's the recommended practice to avoid duplications . Check here and here. Might explain some of the errors people are getting.
It also depends on what you're passing to the Trainer. @naimur-29 do you have a readable copy of the ipynb notebook?
@Erland366 @naimur-29 he is right in that tokenize=False should be used here. That's the recommended practice to avoid duplications . Check here and here. Might explain some of the errors people are getting.
It also depends on what you're passing to the Trainer. @naimur-29 do you have a readable copy of the ipynb notebook?
Just double checked this. That's not the issue in the custom apply_chat_template method in unsloth, tokenize = False is the default value so no need to set it explicitely.
@shimmyshimmer this PR can be closed as is now superseded