During finetuning, `classes_to_id` is not correct
https://github.com/urchade/GLiNER/blob/e15c22a01b1a018674f725428ba1325c723df307/gliner/model.py#L100
This will error out on a KeyError because it is using numeric indices to look up keys that can be strings.
This happens, when entity_types are provided to the create_dataloader
I am not sure to understand
Is it for training or inference?
@urchade training. Following this notebook with the only difference of providing entity_types to training script. Like so:
train_loader = model.create_dataloader(train_data, batch_size=config.train_batch_size, shuffle=True, entity_types=config.entity_types)
Is entity types in the correct format? It should be a list of string
Actually I do not suggest setting entity types during training for better generalization
@urchade yes it is a list of strings
I have the same issue when specifying entity_types in create_dataloader .
The issue is here :
When identity_type is set to none, then class_to_ids is indeed a LIST of dictionnaries (one for each sentence in the batch) :
https://github.com/urchade/GLiNER/blob/e15c22a01b1a018674f725428ba1325c723df307/gliner/modules/base.py#L62
But when providing identity_type, then it is a dictionnary :
https://github.com/urchade/GLiNER/blob/e15c22a01b1a018674f725428ba1325c723df307/gliner/modules/base.py#L113
So a quick fix, when using entity_type, is to deal with each case by removing this line :
https://github.com/urchade/GLiNER/blob/e15c22a01b1a018674f725428ba1325c723df307/gliner/model.py#L100
And instead add :
if isinstance(x["classes_to_id"], list) :
all_types_i = list(x["classes_to_id"][i].keys())
elif isinstance(x["classes_to_id"], dict) :
all_types_i = list(x["classes_to_id"].keys())
So, you want to fix the label during training, for supervised fine-tuning ?
The solution for this is to add the key "label" to each training samples (i.e in addition to "tokenized_text" and "ner"). You can do it as follows:
for i in range(len(train)):
train[i]["label"] = labels
@Ahmedn1 @tcourat