Proposal: Dataset based on subtitles for Japanese movies/tv shows from opensubtitles.org

Open Nan-Do opened this issue 2 years ago • 0 comments

I just have finished a dataset very similar to https://github.com/LAION-AI/Open-Assistant/tree/main/data/datasets/fd_dialogue but for Japanese and taking the data from opensubtitles.org. The dataset contains subtitles for over 7000 tv shows and movies. The dataset is not formatted in the same style as the TV dialogue, with the speaker mentioned first, as that data is practically non-existing for Japanese, at least for the data in OpenSubitles. I was wondering if I should go on and do the pull request following the steps of the readme. Is this dataset good enough? The dataset is already uploaded on huggingface https://huggingface.co/datasets/Nan-Do/OpenSubtitlesJapanese

Apr 19 '23 11:04 Nan-Do