NeMo-text-processing icon indicating copy to clipboard operation
NeMo-text-processing copied to clipboard

Cardinals up to a hundred trillions, timeFST and transliteration

Open kurt0cougar opened this issue 1 year ago • 1 comments

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Before your PR is "Ready for review"

Pre checks:

  • [x ] Have you signed your commits? Use git commit -s to sign.
  • [x ] Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • [ x] If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • [x ] Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • [x ] Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • [ x] Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • [ x] If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • [ ] Remove import guards (try import: ... except: ...) if not already done.
  • [ ] If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • [ x] Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • [ x] New Feature
  • [ ] Bugfix
  • [ ] Documentation
  • [ ] Test

If you haven't finished some of the above items you can still open "Draft" PR.

kurt0cougar avatar Aug 19 '24 05:08 kurt0cougar

@tbartley94 The pull request is ready for review.

kurt0cougar avatar Aug 19 '24 05:08 kurt0cougar

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

Great, thank you for the feedback, I will add the changes and get back.

kurt0cougar avatar Aug 27 '24 23:08 kurt0cougar

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

Great, thank you for the feedback, I will add the changes and get back.

Let me know if you're running low on bandwidth. We can merge the simpler stuff and leave improvements for other PRs.

tbartley94 avatar Aug 30 '24 17:08 tbartley94

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

Great, thank you for the feedback, I will add the changes and get back.

Let me know if you're running low on bandwidth. We can merge the simpler stuff and leave improvements for other PRs.

Yes, we should do this. I fixed most of the issues you pointed out, and I can work on the rest on a future PR.

kurt0cougar avatar Sep 02 '24 04:09 kurt0cougar

@zoobereq can you assist in figuring out the jenkins issue?

tbartley94 avatar Sep 03 '24 17:09 tbartley94

JenkinsFile needs to be up-do-date with the main in order to utilize most-recent grammars in the CI pipeline. The CI checks won't pass otherwise. Either copy the content of JenkinsFile directly from the main and push the updated file to your PR, or rebase Digital-Umuganda:main over the upstream main branch and resolve any conflicts.

zoobereq avatar Sep 03 '24 18:09 zoobereq

Please make sure to accept and implement the formatting changes pushed by pre-commit.ci.

In order to pass the CI pipeline (continuous-integration/jenkins), you will have to update your Jenkinsfile. You can either rebase this branch over the current main or copy-paste the contents of Jenkinsfile from main into the current PR. Rebasing is of course more comprehensive and less risky.

We use Jenkinsfile to define a pipeline for automating the building, testing, and deployment of text normalization (TN) and inverse text normalization (ITN) grammars for multiple languages, using Docker and various tools like PyTorch, NeMo, and pytest. One of the things that Jenkinsfile contains is environment variables for various language-specific text normalization (TN) cache directories (e.g., EN_TN_CACHE for English, DE_TN_CACHE for German, etc.). These directories store precompiled grammars for efficient re-use during the pipeline execution. Keeping them up-to-date is crucial for successful execution of the CI pipeline (and passing all the checks).

Please let me know if you need further assistance with this PR. We will be happy review and merge it as soon as all checks are passed.

zoobereq avatar Sep 10 '24 18:09 zoobereq

Almost there. It's failing the formatting, which is easy to address. Either pull the changes generated by the pre-commit hooks or run setup.py style --fix from the root directory, whichever is easier.

zoobereq avatar Sep 12 '24 04:09 zoobereq

Almost there. It's failing the formatting, which is easy to address. Either pull the changes generated by the pre-commit hooks or run setup.py style --fix from the root directory, whichever is easier.

Done.

kurt0cougar avatar Sep 12 '24 23:09 kurt0cougar

The pre-commit hooks are still failing, due to Black reformatting your code upon saving. Make sure that the formatting adheres to the NeMo guidelines by either pulling the push issued by pre-commit or running setup.py style --fix and disabling Black prior to saving and pushing (if Black is triggered automatically upon saving the code).

CodeQL won't run until the above passes.

zoobereq avatar Sep 16 '24 15:09 zoobereq

@zoobereq I'm feeling if CL tests pass we can just approve and then catch formatting things on our end. Make things easier.

Mind giving a second set of eyes before we approve?

tbartley94 avatar Sep 17 '24 00:09 tbartley94