OneGen icon indicating copy to clipboard operation
OneGen copied to clipboard

Improve HF integration

Open NielsRogge opened this issue 1 year ago • 6 comments

Hi @MikeDean2367,

Niels here from the open-source team at Hugging Face. I discovered your work through the paper page: https://huggingface.co/papers/2409.05152. I work together with AK on improving the visibility of researchers' work on the hub.

It's great to see the models available on the 🤗 hub, would be great to add model cards, along with tags so that people find them when filtering https://huggingface.co/models. We can add tags like "text-generation" so that people will find your work. See more here: https://huggingface.co/docs/huggingface_hub/en/guides/model-cards.

The models can be linked to the paper page by adding https://huggingface.co/papers/2409.05152 in the model card.

Uploading dataset

Would be awesome to make the training dataset available on 🤗 , rather than Google Drive, so that people can do:

from datasets import load_dataset

dataset = load_dataset("your-hf-org/your-dataset")

Besides that, there's the dataset viewer which allows people to quickly explore the first few rows of the data in the browser. See here for a guide: https://huggingface.co/docs/datasets/loading.

Let me know if you're interested/need any help regarding this!

Cheers,

Niels ML Engineer @ HF 🤗

NielsRogge avatar Sep 20 '24 17:09 NielsRogge

Hi Niels,

Thank you for reaching out and for the support! I'm glad to hear that you discovered my work through the Hugging Face paper page.

I really appreciate the suggestion regarding adding model cards and relevant tags for visibility. I'll work on updating the model cards and will include the link to the paper as you recommended. Adding tags like "text-generation" definitely sounds helpful for making the work easier to find.

As for the dataset, I'll look into uploading it to the Hugging Face hub instead of relying on Google Drive. The dataset viewer guide you shared will be very useful.

I'll reach out if I need any further assistance. Thanks again for the guidance!

Best regards, Jintian Zhang

MikeDean2367 avatar Sep 23 '24 01:09 MikeDean2367

Hi Niels,

I'm sorry for the delayed response; other commitments have kept me from getting back to you sooner.

Our code supports loading the training dataset directly from HuggingFace. However, due to some errors, we are currently unable to use the load_dataset function. The details of the errors we encountered can be found here. Therefore, we are using hf_hub_download(repo_id=_hf_path['repo'], filename=_hf_path['name'], repo_type="dataset") to load the training dataset instead.

Thank you for your understanding and for your input regarding this issue.

Best regards, Jintian Zhang

MikeDean2367 avatar Oct 14 '24 07:10 MikeDean2367

Hi,

Thanks for pushing the commit and the explanation!

The reason datasets like https://huggingface.co/datasets/zjunlp/OneGen-TrainDataset-SelfRAG can't be loaded using the load_dataset functionality is because it seems that the data was uploaded just as raw files, rather than with the Datasets library.

One could make the files compatible with Datasets by loading it from JSON and then calling push_to_hub, which would enable:

from datasets import load_dataset

dataset = load_dataset("OneGen-TrainDataset-SelfRAG")

NielsRogge avatar Oct 14 '24 07:10 NielsRogge

Hi, we have tried the following code:

from datasets import load_dataset
dataset = load_dataset("json", data_files="./self_rag/train.jsonl")

But the error is the same:

Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1989, in _prepare_split_single
    writer.write_table(table)
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 583, in write_table
    pa_table = pa_table.combine_chunks()
  File "pyarrow/table.pxi", line 3638, in pyarrow.lib.Table.combine_chunks
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowIndexError: array slice would exceed array length

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mikedean/upload/test.py", line 5, in <module>
    dataset = load_dataset("json",data_files="./self_rag/train.jsonl")
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/load.py", line 2582, in load_dataset
    builder_instance.download_and_prepare(
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

MikeDean2367 avatar Oct 14 '24 07:10 MikeDean2367

Ok this may be because it's JSON lines instead of JSON. One solution here could be to do the following:

import pandas as pd
from huggingface_hub import hf_hub_download
from datasets import Dataset

# read JSON lines
filepath = hf_hub_download(repo_id="zjunlp/OneGen-TrainDataset-SelfRAG", filename="train.jsonl", repo_type="dataset")
df = pd.read_json(filepath, lines=True)

# convert to HF dataset
dataset = Dataset.from_pandas(df)

# push to hub
dataset.push_to_hub("your-hf-username/selfrag")

NielsRogge avatar Oct 14 '24 08:10 NielsRogge

Thank you! This is a great solution! However, I have another question: why does the file train.jsonl in the repository zjunlp/OneGen-TrainDataset-MultiHopQA not have any errors?

MikeDean2367 avatar Oct 14 '24 08:10 MikeDean2367

Hi @MikeDean2367,

I think they just used the web interface to upload that file. It does not seem to be compatible with the Datasets library.

I see your paper does not have any linked datasets yet, did you consider uploading it?

NielsRogge avatar Oct 18 '24 07:10 NielsRogge

Hi @NielsRogge,

Thank you for your feedback! I will update the paper soon and add the link to the dataset. I appreciate your suggestion!

MikeDean2367 avatar Oct 18 '24 10:10 MikeDean2367