datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Include entity positions as feature in ReCoRD

Open richarddwang opened this issue 3 years ago • 1 comments

https://huggingface.co/datasets/super_glue/viewer/record/validation

TLDR: We need to record entity positions, which are included in the source data but excluded by the loading script, to enable efficient and effective training for ReCoRD.

Currently, the loading script ignores the entity positions ("entity_start", "entity_end") and only records entity text. This might be because the training method of the official baseline is to make n training instance from a datapoint by replacing "@ placeholder" in query with each entity individually.

But it increases the already heavy computation by multiple folds. So DeBERTa uses a method that take entity embeddings by their positions in the passage, and thus makes one training instance from one data point. It is way more efficient and proved effective for the ReCoRD task.

Can anybody help me with the dataset card rendering error? Maybe @lhoestq ?

richarddwang avatar Jun 12 '22 11:06 richarddwang

The documentation is not available anymore as the PR was closed or merged.

Thanks for the reply @lhoestq !

I have sucessed on datasets-cli test ./datasets/super_glue --name record --save_infos, But as you can see, the check ran into FAILED tests/test_dataset_cards.py::test_changed_dataset_card[super_glue] - V.... How can we solve it?

richarddwang avatar Aug 17 '22 23:08 richarddwang

That would be neat! Let me implement it.

richarddwang avatar Aug 19 '22 01:08 richarddwang