GLiNER icon indicating copy to clipboard operation
GLiNER copied to clipboard

URGENT : Fine tuning notebook gives run time error after certain steps

Open vatsaldin opened this issue 1 year ago • 12 comments

I am using the model

urchade/gliner_multi_pii-v1

GLiNER_fine_tuning_error

vatsaldin avatar Apr 29 '24 06:04 vatsaldin

Ok, this due to a problem in the data loader

You can add an exception in the training loops to avoid stopping the training

urchade avatar Apr 29 '24 07:04 urchade

I have added as below the try block to catch exception

image

Now while I run it for the 10,000 steps -- the default configuration in finetune.ipynb notebook it ends execution as below

image

@urchade

anything wrong?

Please advise.

vatsaldin avatar Apr 29 '24 12:04 vatsaldin

Any updates on this? When I do the try except block it fails every iteration.

micaelakaplan avatar May 01 '24 19:05 micaelakaplan

Hi, I will try to investigate the problem in details, sorry for the delay @micaelakaplan @vatsaldin

urchade avatar May 01 '24 20:05 urchade

check if the format of your dataset is correct, if is correct modify the function like this and try

def train(model, config, train_data, eval_data=None): model = model.to(config.device) model.set_sampling_params( max_types=config.max_types, shuffle_types=config.shuffle_types, random_drop=config.random_drop, max_neg_type_ratio=config.max_neg_type_ratio, max_len=config.max_len )

model.train()
train_loader = model.create_dataloader(train_data, batch_size=config.train_batch_size, shuffle=True)
optimizer = model.get_optimizer(config.lr_encoder, config.lr_others, config.freeze_token_rep)
pbar = tqdm(range(config.num_steps))
num_warmup_steps = int(config.num_steps * config.warmup_ratio) if config.warmup_ratio < 1 else int(config.warmup_ratio)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, config.num_steps)
iter_train_loader = iter(train_loader)

for step in pbar:
    try:
        x = next(iter_train_loader)
    except StopIteration:
        iter_train_loader = iter(train_loader)
        x = next(iter_train_loader)

    for k, v in x.items():
        if isinstance(v, torch.Tensor):
            x[k] = v.to(config.device)

    try:
        loss = model(x)  # Forward pass
    except RuntimeError as e:
        print(f"Error during forward pass at step {step}: {e}")
        print(f"x: {x}")
        continue

    if torch.isnan(loss):
        print("Loss is NaN, skipping...")
        continue

    loss.backward()  # Compute gradients
    optimizer.step()  # Update parameters
    scheduler.step()  # Update learning rate schedule
    optimizer.zero_grad()  # Reset gradients

    description = f"step: {step} | epoch: {step // len(train_loader)} | loss: {loss.item():.2f}"
    pbar.set_description(description)

    if (step + 1) % config.eval_every == 0:
        model.eval()
        if eval_data:
            results, f1 = model.evaluate(eval_data["samples"], flat_ner=True, threshold=0.5, batch_size=12, entity_types=eval_data["entity_types"])
            print(f"Step={step}\n{results}")

        if not os.path.exists(config.save_directory):
            os.makedirs(config.save_directory)

        model.save_pretrained(f"{config.save_directory}/finetuned_{step}")
        model.train()

giusedm avatar May 02 '24 23:05 giusedm

For me, this error is caused by finetuning examples that lack any labels.

peter-axion avatar May 07 '24 16:05 peter-axion

I could reproduce the problem when finetuning with conll2003 dataset. The data point causing the crash is

{'tokenized_text': ['Reiterates', 'previous', '"', 'buy', '"', 'recommendation', 'after', 'results', '.'], 'ner': []}

At this line when the crash happens, logits_label = scores.view(-1, num_classes)

The values of

scores:tensor([], device='cuda:0', size=(4, 9, 12, 0), grad_fn=<ViewBackward0>)
num_classes:0
entity_type_mask:tensor([], device='cuda:0', size=(4, 0), dtype=torch.bool)

In this example, the ground truth ner list is empty. But the scores is calculated from the input data, so shouldn't this be handled in the code?

@urchade, can you please comment? Section 4.3 of the paper also talks about in-domain supervised training on CoNLL03, how come this issue was not encountered?

mkmohangb avatar May 16 '24 10:05 mkmohangb

for conll 03 dataset, you can fix the label by setting a key label for each sample, that's how I did for supervised fine-tuning.

{'tokenized_text': ['Reiterates', 'previous', '"', 'buy', '"', 'recommendation', 'after', 'results', '.', 'label': ["person", "org", ...], 'ner': []}

urchade avatar May 16 '24 11:05 urchade

what can be done for normal dataset? @urchade or if you could release a fix.

vatsaldin avatar May 16 '24 12:05 vatsaldin

what can be done for normal dataset? @urchade or if you could release a fix.

what do you mean ? you just have to add a key label in the dictionary

urchade avatar May 16 '24 13:05 urchade

I mean the other datasets(non conll 03)

vatsaldin avatar May 16 '24 13:05 vatsaldin

for finetuning ?

if your labels are fixed just add them to label. If not, negative entities are samples in batch

urchade avatar May 16 '24 13:05 urchade