GLiNER icon indicating copy to clipboard operation
GLiNER copied to clipboard

During finetuning, `classes_to_id` is not correct

Open Ahmedn1 opened this issue 1 year ago • 6 comments

https://github.com/urchade/GLiNER/blob/e15c22a01b1a018674f725428ba1325c723df307/gliner/model.py#L100

This will error out on a KeyError because it is using numeric indices to look up keys that can be strings. This happens, when entity_types are provided to the create_dataloader

Ahmedn1 avatar Mar 20 '24 04:03 Ahmedn1

I am not sure to understand

Is it for training or inference?

urchade avatar Mar 20 '24 05:03 urchade

@urchade training. Following this notebook with the only difference of providing entity_types to training script. Like so:

train_loader = model.create_dataloader(train_data, batch_size=config.train_batch_size, shuffle=True, entity_types=config.entity_types)

Ahmedn1 avatar Mar 20 '24 06:03 Ahmedn1

Is entity types in the correct format? It should be a list of string

Actually I do not suggest setting entity types during training for better generalization

urchade avatar Mar 20 '24 06:03 urchade

@urchade yes it is a list of strings

Ahmedn1 avatar Mar 20 '24 19:03 Ahmedn1

I have the same issue when specifying entity_types in create_dataloader .

The issue is here :

When identity_type is set to none, then class_to_ids is indeed a LIST of dictionnaries (one for each sentence in the batch) : https://github.com/urchade/GLiNER/blob/e15c22a01b1a018674f725428ba1325c723df307/gliner/modules/base.py#L62

But when providing identity_type, then it is a dictionnary : https://github.com/urchade/GLiNER/blob/e15c22a01b1a018674f725428ba1325c723df307/gliner/modules/base.py#L113

So a quick fix, when using entity_type, is to deal with each case by removing this line : https://github.com/urchade/GLiNER/blob/e15c22a01b1a018674f725428ba1325c723df307/gliner/model.py#L100

And instead add :

if isinstance(x["classes_to_id"], list) :
    all_types_i = list(x["classes_to_id"][i].keys())
elif isinstance(x["classes_to_id"], dict) :
    all_types_i = list(x["classes_to_id"].keys())

tcourat avatar Apr 18 '24 13:04 tcourat

So, you want to fix the label during training, for supervised fine-tuning ?

The solution for this is to add the key "label" to each training samples (i.e in addition to "tokenized_text" and "ner"). You can do it as follows:

for i in range(len(train)):
    train[i]["label"] = labels

@Ahmedn1 @tcourat

urchade avatar Apr 18 '24 21:04 urchade