notebooks icon indicating copy to clipboard operation
notebooks copied to clipboard

updated kaggle-Gemma3_(4B) notebook

Open naimur-29 opened this issue 1 year ago • 5 comments

Set tokenize=False in tokenizer.apply_chat_template It won't run otherwise since the tokenizing is happening twice. I faced this minor issue when running today.

def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples["conversations"], tokenize = False) # here
    return { "text" : texts }
pass
tokenizer = get_chat_template(
    tokenizer,
    tokenize = False, # and here (inference and two other following cells)
    chat_template = "gemma-3",
)

naimur-29 avatar Mar 22 '25 09:03 naimur-29

It works fine in my case without the specification. Can you give the error maybe?

Erland366 avatar Mar 27 '25 08:03 Erland366

It works fine in my case without the specification. Can you give the error maybe?

Like I mentioned, it won't run since the tokenizing is happening twice.

Without specification:

image

Error later on:

image

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-ada16cbe0bb7> in <cell line: 2>()
      1 from trl import SFTTrainer, SFTConfig
----> 2 trainer = SFTTrainer(
      3     model = model,
      4     tokenizer = tokenizer,
      5     train_dataset = dataset,

/usr/local/lib/python3.10/dist-packages/unsloth/trainer.py in new_init(self, *args, **kwargs)
    201             kwargs["args"] = config
    202         pass
--> 203         original_init(self, *args, **kwargs)
    204     pass
    205     return new_init

/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func, **kwargs)
   1008         fix_zero_training_loss(model, tokenizer, train_dataset)
   1009 
-> 1010         super().__init__(
   1011             model = model,
   1012             args = args,

/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py in wrapped_func(*args, **kwargs)
    170                 warnings.warn(message, FutureWarning, stacklevel=2)
    171 
--> 172             return func(*args, **kwargs)
    173 
    174         return wrapped_func

/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizers, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func)
    458         preprocess_dataset = args.dataset_kwargs is None or not args.dataset_kwargs.get("skip_prepare_dataset", False)
    459         if preprocess_dataset:
--> 460             train_dataset = self._prepare_dataset(
    461                 train_dataset, processing_class, args, args.packing, formatting_func, "train"
    462             )

/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in _prepare_dataset(self, dataset, processing_class, args, packing, formatting_func, dataset_name)
    704 
    705             if bos_token is not None:
--> 706                 if test_text.startswith(bos_token) or bos_token in chat_template:
    707                     add_special_tokens = False
    708                     print("Unsloth: We found double BOS tokens - we shall remove one automatically.")

AttributeError: 'int' object has no attribute 'startswith'

With specification:

image

naimur-29 avatar Mar 27 '25 14:03 naimur-29

@Erland366 @naimur-29 he is right in that tokenize=False should be used here. That's the recommended practice to avoid duplications . Check here and here. Might explain some of the errors people are getting.

It also depends on what you're passing to the Trainer. @naimur-29 do you have a readable copy of the ipynb notebook?

rolandtannous avatar Apr 20 '25 10:04 rolandtannous

@Erland366 @naimur-29 he is right in that tokenize=False should be used here. That's the recommended practice to avoid duplications . Check here and here. Might explain some of the errors people are getting.

It also depends on what you're passing to the Trainer. @naimur-29 do you have a readable copy of the ipynb notebook?

Just double checked this. That's not the issue in the custom apply_chat_template method in unsloth, tokenize = False is the default value so no need to set it explicitely.

rolandtannous avatar Apr 21 '25 08:04 rolandtannous

@shimmyshimmer this PR can be closed as is now superseded

rolandtannous avatar May 26 '25 03:05 rolandtannous