datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Jehovah Witness Sign Language Resources

Open AmitMY opened this issue 2 years ago • 38 comments

We should add resources from JW, like the bible.

AmitMY avatar Feb 15 '23 19:02 AmitMY

@ShesterG this is something you could perhaps take on ;)

bricksdont avatar Feb 15 '23 20:02 bricksdont

Chipping in here to provide a bit of assistance with Shester's dataset. Asked his permission to help!

It seems he got started with this at https://github.com/ShesterG/datasets/tree/shester/sign_language_datasets/datasets/jw_sign

cleong110 avatar Dec 13 '23 23:12 cleong110

Shester informs me that the annotations can be recreated via https://github.com/ShesterG/datasets/blob/shester/sign_language_datasets/datasets/jw_sign/create_index.py, it just takes a long time to run.

They've been precomputed/saved off, they just need to be hosted somewhere.

cleong110 avatar Dec 13 '23 23:12 cleong110

OK, for now the files are uploaded to https://drive.google.com/drive/folders/1QFmq5Byg0xTLgJ7sBdVuQBlgxrkzp9vV , thank you Shester! Now we need to...

  • [ ] using DGS Corpus as inspiration, add data-loading functionality to https://github.com/ShesterG/datasets/tree/shester/sign_language_datasets/datasets/jw_sign. Should ideally work the same as DGS in the example notebook.
  • [x] DGS Corpus uses tfds download manager, see this function in the main branch for example. So we need to figure out how to download Google Drive files with tfds
  • [ ] Then, figure out how to test all this. (perhaps adapt https://github.com/ShesterG/datasets/blob/shester/sign_language_datasets/datasets/dgs_corpus/dgs_corpus_test.py?)

cleong110 avatar Jan 04 '24 16:01 cleong110

One thing for us to consider: Google Drive will sometimes cause issues if too many people download the same files. See https://github.com/tensorflow/datasets/issues/1482

and

https://www.tensorflow.org/datasets/overview#manual_download_if_download_fails

cleong110 avatar Jan 04 '24 16:01 cleong110

OK, the following actually manages to download newindex.list.gz, but saves it with a weird name. However when I manually rename it, I can open the file with 7zip and see it's the right file.

import tensorflow_datasets as tfds


if __name__ == "__main__":
    ####################################
    # try to download newindex.list.gz
    #####################################

    # downloads a 0 MB empty file
    google_drive_link_to_newindex = "https://drive.google.com/file/d/1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf/view?usp=drive_link"

    # extract the ID from above and append it to "https://drive.google.com/uc?id="
    # downloads an actual file. 
    google_drive_link_to_newindex_take2 = "https://drive.google.com/uc?id=1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf"
    dl_manager = tfds.download.DownloadManager(download_dir="./foo")

    extracted_path = dl_manager.download(google_drive_link_to_newindex_take2)

    # ends up printing "foo\ucid_1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk"
    # which is 1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf (the ID from above, followed by "2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk" which I don't understand what that is)
    print(extracted_path)

image

cleong110 avatar Jan 04 '24 16:01 cleong110

Implication here is that the strategy of doing URLs like "https://drive.google.com/uc?id=<ID goes here>" seems to work

cleong110 avatar Jan 04 '24 18:01 cleong110

OK, the next thing I want to figure out is how to actually download and load files.

DGS Corpus actually includes a "dgs.json" in github, about 440 kB in size.

https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/dgs_corpus/dgs.json

When I open it up with Firefox, the format looks like this, looks like there's links to files in there. image

This is most similar, I think, to our "newindex.list.gz", in that there's a list of unique data items, with URL links to videos.

cleong110 avatar Jan 04 '24 21:01 cleong110

Here's my notes on newindex.list.gz (drive link):

Keys: ['video_url', 'video_name', 'verse_lang', 'verse_name', 'verse_start', 'verse_end', 'duration', 'verse_unique', 'verseID']

First 10:

{'video_url': 'https://download-a.akamaihd.net/files/media_publication/f3/nwt_01_Ge_ALS_03_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_03_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 3:15', 'verse_start': '0.000000', 'verse_end': '31.198000', 'duration': 31.198, 'verse_unique': 'ALS Zan. 3:15', 'verseID': 'v1003015'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:2', 'verse_start': '0.000000', 'verse_end': '26.760000', 'duration': 26.76, 'verse_unique': 'ALS Zan. 39:2', 'verseID': 'v1039002'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:3', 'verse_start': '26.760000', 'verse_end': '47.848000', 'duration': 21.087999999999997, 'verse_unique': 'ALS Zan. 39:3', 'verseID': 'v1039003'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/10/nwt_03_Le_ALS_19_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_19_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 19:18', 'verse_start': '0.000000', 'verse_end': '32.399000', 'duration': 32.399, 'verse_unique': 'ALS Lev. 19:18', 'verseID': 'v3019018'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/0c/nwt_03_Le_ALS_25_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_25_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 25:10', 'verse_start': '0.000000', 'verse_end': '8.320000', 'duration': 8.32, 'verse_unique': 'ALS Lev. 25:10', 'verseID': 'v3025010'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:6', 'verse_start': '0.000000', 'verse_end': '7.341000', 'duration': 7.341, 'verse_unique': 'ALS Ligj. 6:6', 'verseID': 'v5006006'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:7', 'verse_start': '7.341000', 'verse_end': '24.024000', 'duration': 16.683, 'verse_unique': 'ALS Ligj. 6:7', 'verseID': 'v5006007'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/3d/nwt_05_De_ALS_10_r720P.mp4', 'video_name': 'nwt_05_De_ALS_10_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 10:20', 'verse_start': '0.000000', 'verse_end': '10.644000', 'duration': 10.644, 'verse_unique': 'ALS Ligj. 10:20', 'verseID': 'v5010020'}       
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/34/nwt_05_De_ALS_32_r720P.mp4', 'video_name': 'nwt_05_De_ALS_32_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 32:4', 'verse_start': '0.000000', 'verse_end': '43.844000', 'duration': 43.844, 'verse_unique': 'ALS Ligj. 32:4', 'verseID': 'v5032004'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/1e/nwt_09_1Sa_ALS_01_r720P.mp4', 'video_name': 'nwt_09_1Sa_ALS_01_r720P', 'verse_lang': 'ALS', 'verse_name': '1 Sam. 1:15', 'verse_start': '0.000000', 'verse_end': '23.557000', 'duration': 23.557, 'verse_unique': 'ALS 1 Sam. 1:15', 'verseID': 'v9001015'}

Which is a pickled list, compressed with gzip.

In compressed form it is about 19,000 KB, or about 19 MB.

In decompressed form it's closer to 100MB.

cleong110 avatar Jan 04 '24 21:01 cleong110

(Side note: investigate Parquet data format?)

cleong110 avatar Jan 16 '24 15:01 cleong110

(or Arrow?)

cleong110 avatar Jan 16 '24 15:01 cleong110

Another TODO at some point: Upload files to a better hosting platform, e.g. Zenodo or Zindi, to prevent issues with "too many downloads".

cleong110 avatar Jan 16 '24 15:01 cleong110

What are the numbers in DGS? Unique IDs? Should we generate some for our dataset? image

cleong110 avatar Jan 16 '24 16:01 cleong110

JSON for DGS is parsed here: https://github.com/sign-language-processing/datasets/blob/e864f36ddc452587a80f7622630a0871cd406a0d/sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py#L278

cleong110 avatar Jan 16 '24 17:01 cleong110

And the JSON is created here: https://github.com/sign-language-processing/datasets/blob/e864f36ddc452587a80f7622630a0871cd406a0d/sign_language_datasets/datasets/dgs_corpus/create_index.py#L17, which calls the numbers "tr_id"

cleong110 avatar Jan 16 '24 17:01 cleong110

Ah... "transcript ID". And they're not generated in the Python code, they're parsed from the source page via regex.

cleong110 avatar Jan 16 '24 17:01 cleong110

https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md helpful guide for TFDS datasets.

Also apparently tfds.testing is a thing, used for example in dgs_corpus_test.py https://tensorflow.google.cn/datasets/api_docs/python/tfds/testing

cleong110 avatar Jan 18 '24 00:01 cleong110

https://tensorflow.google.cn/datasets/add_dataset?hl=en actually the helpful guide above is source code for this

cleong110 avatar Jan 18 '24 00:01 cleong110

Went and figured out how the index was created, and pushed an updated version of the create_index.py https://github.com/ShesterG/datasets/pull/1

Now that I know the data a bit better, gonna move on to filling out the Builder class for JWSign

cleong110 avatar Jan 24 '24 18:01 cleong110

image_480 From the presented slides for JWSign, this is what we're going for

cleong110 avatar Feb 15 '24 19:02 cleong110

Not being familiar with tfds or sign_language_datasets I am attempting to go with a "get basic functionality working and then test it" approach. But then I ran into the issue of not knowing how to test a dataset locally. #53 documents part of this, but the basic guide to testing is:

  1. Make sure you install from source.
  2. pip install pytest pytest-cov dill to get the testing deps
  3. pytest . in whatever folder you want to run tests for, incl. the top-level.

Of course the next question is how to make tests!

cleong110 avatar Feb 21 '24 19:02 cleong110

OK, so even if you just follow https://tensorflow.google.cn/datasets/add_dataset?hl=en#test_your_dataset and add nothing, you'll still get some basic unit tests: image

cleong110 avatar Feb 21 '24 19:02 cleong110

OK, testing procedure:

conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source 
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .

cleong110 avatar Mar 01 '24 20:03 cleong110

All right, after messing around with #56, it seems that by deleting this file I am then able to run pytests.

I had also ran into another weird issue in #57, where pytest ran into an error while telling me what a different error was.

Now I can finally proceed with JW Sign dataset some more. Let's see if I can make a version which at least downloads the spoken-language text, and maybe make a much simplified index for testing purposes.

cleong110 avatar Mar 04 '24 16:03 cleong110

In order to iterate/test the dataset I will need to:

# in the top-level directory of the repo, with __init__.py removed
# make some change to the builder script
pytest ./sign_language_datasets/datasets/new_dataset/
# pip install . # not necessary actually, you can simply run the test

cleong110 avatar Mar 04 '24 17:03 cleong110

OK, I did

# navigate to sign_language_datasets/datasets/
tfds new new_dataset # create a new directory

And then repeatedly edited, pytested, using the rwth_phoenix2014_t code as a base, until the VideoTest passed. Excellent.

cleong110 avatar Mar 04 '24 17:03 cleong110

OK, I'm starting from scratch. I made a new fork, this time forking off of https://github.com/sign-language-processing/datasets so that I'm up to date.

cleong110 avatar Mar 04 '24 20:03 cleong110

https://github.com/cleong110/datasets/tree/jw_sign

cleong110 avatar Mar 04 '24 20:03 cleong110

I want to see if I can make a completely basic text-only dataset to start.

cleong110 avatar Mar 04 '24 20:03 cleong110

Apparently Google Drive doesn't play nice. When I try to use tfds' download_and_extract method on the text51.dict.gz file, I get a .html instead.

Turns out Google likes to pop up a "can't scan this for viruses" message, and that's what gets downloaded.

gdown library works, but then that doesn't play with tfds

Here's my Colab notebook playing with it: https://colab.research.google.com/drive/1EMKnpKrDUHxq5COFM6Acm7PqAQmkdvTS?usp=sharing

cleong110 avatar Mar 04 '24 21:03 cleong110