Cardinals up to a hundred trillions, timeFST and transliteration
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Before your PR is "Ready for review"
Pre checks:
- [x ] Have you signed your commits? Use
git commit -sto sign. - [x ] Do all unittests finish successfully before sending PR?
-
pytestor (if your machine does not have GPU)pytest --cpufrom the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')). - Sparrowhawk tests
bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
-
- [ x] If you are adding a new feature: Have you added test cases for both
pytestand Sparrowhawk here. - [x ] Have you added
__init__.pyfor every folder and subfolder, includingdatafolder which has .TSV files? - [x ] Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
- [ x] Have you added the correct license header
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.to all newly added Python files? - [ x] If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be
Copyright 2015 and onwards Google, Inc.. See an example here. - [ ] Remove import guards (
try import: ... except: ...) if not already done. - [ ] If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
- [ x] Have you added your language support to tools/text_processing_deployment/pynini_export.py.
PR Type:
- [ x] New Feature
- [ ] Bugfix
- [ ] Documentation
- [ ] Test
If you haven't finished some of the above items you can still open "Draft" PR.
@tbartley94 The pull request is ready for review.
Major requests are just to move more of the string maps into the
datafolder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.
Great, thank you for the feedback, I will add the changes and get back.
Major requests are just to move more of the string maps into the
datafolder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.Great, thank you for the feedback, I will add the changes and get back.
Let me know if you're running low on bandwidth. We can merge the simpler stuff and leave improvements for other PRs.
Major requests are just to move more of the string maps into the
datafolder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.Great, thank you for the feedback, I will add the changes and get back.
Let me know if you're running low on bandwidth. We can merge the simpler stuff and leave improvements for other PRs.
Yes, we should do this. I fixed most of the issues you pointed out, and I can work on the rest on a future PR.
@zoobereq can you assist in figuring out the jenkins issue?
JenkinsFile needs to be up-do-date with the main in order to utilize most-recent grammars in the CI pipeline. The CI checks won't pass otherwise. Either copy the content of JenkinsFile directly from the main and push the updated file to your PR, or rebase Digital-Umuganda:main over the upstream main branch and resolve any conflicts.
Please make sure to accept and implement the formatting changes pushed by pre-commit.ci.
In order to pass the CI pipeline (continuous-integration/jenkins), you will have to update your Jenkinsfile. You can either rebase this branch over the current main or copy-paste the contents of Jenkinsfile from main into the current PR. Rebasing is of course more comprehensive and less risky.
We use Jenkinsfile to define a pipeline for automating the building, testing, and deployment of text normalization (TN) and inverse text normalization (ITN) grammars for multiple languages, using Docker and various tools like PyTorch, NeMo, and pytest. One of the things that Jenkinsfile contains is environment variables for various language-specific text normalization (TN) cache directories (e.g., EN_TN_CACHE for English, DE_TN_CACHE for German, etc.). These directories store precompiled grammars for efficient re-use during the pipeline execution. Keeping them up-to-date is crucial for successful execution of the CI pipeline (and passing all the checks).
Please let me know if you need further assistance with this PR. We will be happy review and merge it as soon as all checks are passed.
Almost there. It's failing the formatting, which is easy to address. Either pull the changes generated by the pre-commit hooks or run setup.py style --fix from the root directory, whichever is easier.
Almost there. It's failing the formatting, which is easy to address. Either pull the changes generated by the
pre-commithooks or runsetup.py style --fixfrom the root directory, whichever is easier.
Done.
The pre-commit hooks are still failing, due to Black reformatting your code upon saving. Make sure that the formatting adheres to the NeMo guidelines by either pulling the push issued by pre-commit or running setup.py style --fix and disabling Black prior to saving and pushing (if Black is triggered automatically upon saving the code).
CodeQL won't run until the above passes.
@zoobereq I'm feeling if CL tests pass we can just approve and then catch formatting things on our end. Make things easier.
Mind giving a second set of eyes before we approve?