URGENT : Fine tuning notebook gives run time error after certain steps
I am using the model
urchade/gliner_multi_pii-v1
Ok, this due to a problem in the data loader
You can add an exception in the training loops to avoid stopping the training
I have added as below the try block to catch exception
Now while I run it for the 10,000 steps -- the default configuration in finetune.ipynb notebook it ends execution as below
@urchade
anything wrong?
Please advise.
Any updates on this? When I do the try except block it fails every iteration.
Hi, I will try to investigate the problem in details, sorry for the delay @micaelakaplan @vatsaldin
check if the format of your dataset is correct, if is correct modify the function like this and try
def train(model, config, train_data, eval_data=None): model = model.to(config.device) model.set_sampling_params( max_types=config.max_types, shuffle_types=config.shuffle_types, random_drop=config.random_drop, max_neg_type_ratio=config.max_neg_type_ratio, max_len=config.max_len )
model.train()
train_loader = model.create_dataloader(train_data, batch_size=config.train_batch_size, shuffle=True)
optimizer = model.get_optimizer(config.lr_encoder, config.lr_others, config.freeze_token_rep)
pbar = tqdm(range(config.num_steps))
num_warmup_steps = int(config.num_steps * config.warmup_ratio) if config.warmup_ratio < 1 else int(config.warmup_ratio)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, config.num_steps)
iter_train_loader = iter(train_loader)
for step in pbar:
try:
x = next(iter_train_loader)
except StopIteration:
iter_train_loader = iter(train_loader)
x = next(iter_train_loader)
for k, v in x.items():
if isinstance(v, torch.Tensor):
x[k] = v.to(config.device)
try:
loss = model(x) # Forward pass
except RuntimeError as e:
print(f"Error during forward pass at step {step}: {e}")
print(f"x: {x}")
continue
if torch.isnan(loss):
print("Loss is NaN, skipping...")
continue
loss.backward() # Compute gradients
optimizer.step() # Update parameters
scheduler.step() # Update learning rate schedule
optimizer.zero_grad() # Reset gradients
description = f"step: {step} | epoch: {step // len(train_loader)} | loss: {loss.item():.2f}"
pbar.set_description(description)
if (step + 1) % config.eval_every == 0:
model.eval()
if eval_data:
results, f1 = model.evaluate(eval_data["samples"], flat_ner=True, threshold=0.5, batch_size=12, entity_types=eval_data["entity_types"])
print(f"Step={step}\n{results}")
if not os.path.exists(config.save_directory):
os.makedirs(config.save_directory)
model.save_pretrained(f"{config.save_directory}/finetuned_{step}")
model.train()
For me, this error is caused by finetuning examples that lack any labels.
I could reproduce the problem when finetuning with conll2003 dataset. The data point causing the crash is
{'tokenized_text': ['Reiterates', 'previous', '"', 'buy', '"', 'recommendation', 'after', 'results', '.'], 'ner': []}
At this line when the crash happens,
logits_label = scores.view(-1, num_classes)
The values of
scores:tensor([], device='cuda:0', size=(4, 9, 12, 0), grad_fn=<ViewBackward0>)
num_classes:0
entity_type_mask:tensor([], device='cuda:0', size=(4, 0), dtype=torch.bool)
In this example, the ground truth ner list is empty. But the scores is calculated from the input data, so shouldn't this be handled in the code?
@urchade, can you please comment? Section 4.3 of the paper also talks about in-domain supervised training on CoNLL03, how come this issue was not encountered?
for conll 03 dataset, you can fix the label by setting a key label for each sample, that's how I did for supervised fine-tuning.
{'tokenized_text': ['Reiterates', 'previous', '"', 'buy', '"', 'recommendation', 'after', 'results', '.', 'label': ["person", "org", ...], 'ner': []}
what can be done for normal dataset? @urchade or if you could release a fix.
what can be done for normal dataset? @urchade or if you could release a fix.
what do you mean ? you just have to add a key label in the dictionary
I mean the other datasets(non conll 03)
for finetuning ?
if your labels are fixed just add them to label. If not, negative entities are samples in batch