Updates MFA scripts and pipeline
I have updated the following to make MFA work and clean up MFA scripts.
-
phoneme_tokenizer.py: added MFA G2P for all MFA languages. Added Espeak G2P including word separator, as training MFA with custom dictionary requires word splitting. -
mfa_cleaners.py: text cleaner added for MFA English -
cleaner.pyincludes MFA English and edited to put all cleaning functions in TextCleaner.init() instead of call() -
mfa_format.py: cleaned up help and comments. Allows both json and TextGrid formats (Praat only reads .TextGrid). make_dictionary allows multiple pronunciations per word. -
install_mfa.shis updated to the latest working version of MFA. Disabled pip installation as it is not recommended. -
mfa.sh: cleaned up help and comments. Updated for new MFA syntax as some args were wrong. Added options for textgrid format and single speaker. -
local/run_mfa.shandrun_mfa.share updated to reflect above changes -
copy_data_dir.shnow has durations file
The simplified process to train TTS with MFA is now:
# cd to egs2/*/tts1/
./local/run_mfa.sh [MFA_OPTIONS]
./run_mfa.sh [OPTIONS]
Thanks for your PR, @iamanigeeit! Can you fix the CI error? https://github.com/espnet/espnet/actions/runs/7986631083/job/21807429857?pr=5670
@Fhrozen, can you review this PR?
@sw005320 Sure, I will be checking it
@sw005320 @Fhrozen
For test python, I can modify the files to fit the 80 char limit per line. I was using 120 chars as the limit because it looks like most ESPnet files break 80 chars anyway (should we increase it to 120?).
For check_kaldi_symlinks, does it mean that copy_data_dir.sh must be identical between egs2/TEMPLATE/asr1/utils and tools/kaldi/egs/wsj/s5 ? Should I send a PR for kaldi to make them match?
For
check_kaldi_symlinks, does it mean thatcopy_data_dir.shmust be identical betweenegs2/TEMPLATE/asr1/utilsandtools/kaldi/egs/wsj/s5? Should I send a PR for kaldi to make them match?
Yes, it should be identical, but if you want to customize it for your own purpose, maybe you can put it in other places or rename it.
For
check_kaldi_symlinks, does it mean thatcopy_data_dir.shmust be identical betweenegs2/TEMPLATE/asr1/utilsandtools/kaldi/egs/wsj/s5? Should I send a PR for kaldi to make them match?Yes, it should be identical, but if you want to customize it for your own purpose, maybe you can put it in other places or rename it.
Hmm, in that case I propose creating a new file copy_tts_data_dir.sh and modifying tts.sh to use it. Is that better?
Codecov Report
Attention: Patch coverage is 35.93750% with 82 lines in your changes missing coverage. Please review.
Project coverage is 35.81%. Comparing base (
fa822d5) to head (d5d9cf5). Report is 390 commits behind head on master.
| Files | Patch % | Lines |
|---|---|---|
| espnet2/text/phoneme_tokenizer.py | 13.33% | 78 Missing :warning: |
| espnet2/text/cleaner.py | 71.42% | 4 Missing :warning: |
:exclamation: There is a different number of reports uploaded between BASE (fa822d5) and HEAD (d5d9cf5). Click for more details.
HEAD has 22 uploads less than BASE
Flag BASE (fa822d5) HEAD (d5d9cf5) test_python_espnetez 8 6 test_utils 10 3 test_integration_espnetez 8 0 test_python_espnet2 3 0 test_configuration_espnet2 2 0
Additional details and impacted files
@@ Coverage Diff @@
## master #5670 +/- ##
===========================================
- Coverage 72.21% 35.81% -36.40%
===========================================
Files 760 759 -1
Lines 69840 69925 +85
===========================================
- Hits 50435 25046 -25389
- Misses 19405 44879 +25474
| Flag | Coverage Δ | |
|---|---|---|
| test_configuration_espnet2 | ? |
|
| test_integration_espnet1 | 62.92% <ø> (ø) |
|
| test_integration_espnetez | ? |
|
| test_python_espnet1 | 18.27% <26.56%> (+0.06%) |
:arrow_up: |
| test_python_espnet2 | ? |
|
| test_python_espnetez | 13.96% <16.40%> (+<0.01%) |
:arrow_up: |
| test_utils | 20.91% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
It looks like the entire directory egs2/TEMPLATE/asr1/utils must be identical to tools/kaldi/egs/wsj/s5. In that case, i don't know where to add copy_tts_data_dir.sh. Can i just make the PR for Kaldi? It is a very simple change to add durations.
As for the other failing checks, i think it is not related to this PR.
It looks like the entire directory
egs2/TEMPLATE/asr1/utilsmust be identical totools/kaldi/egs/wsj/s5. In that case, i don't know where to addcopy_tts_data_dir.sh. Can i just make the PR for Kaldi? It is a very simple change to adddurations.As for the other failing checks, i think it is not related to this PR.
You can just move it from egs2/TEMPLATE/asr1/utils/copy_tts_data_dir.sh to egs2/TEMPLATE/asr1/scripts/utils/copy_tts_data_dir.sh
@iamanigeeit , If possible, add some test for the espnet2/text/mfa_cleaners.py and the new lines you are adding at espnet2/text/cleaner.py
@Fhrozen I have added test_cleaners.py. However, i don't know enough Japanese, Korean or Vietnamese to write tests for those.
Is there a way to pause or stop the CI checks before everything is finalized? I feel like i am wasting ESPnet project funds on minor changes...
You can use [skip ci] at the beginning of the commit message to avoid ci test.
Also, you can run the test script on your local environment using ./ci/test_*.sh from the root of your repo.
If it is necessary, let me know to share a docker image file for testing it.
@Fhrozen I added [skip ci] at https://github.com/espnet/espnet/pull/5680 but it's still running the checks, is there anything else i have to do?
Hi @iamanigeeit, I do not follow all conversations with @Fhrozen, but we have a CI issue in https://github.com/espnet/espnet/actions/runs/8048817075/job/21980956442?pr=5670#step:8:37
Can you fix it?
Hi @iamanigeeit, I do not follow all conversations with @Fhrozen, but we have a CI issue in https://github.com/espnet/espnet/actions/runs/8048817075/job/21980956442?pr=5670#step:8:37
Can you fix it?
Hi @sw005320 --
The CI scripts on my end run fine -- the error is that whisper isn't installed on the CI environment, so whisper text cleaners are not supported. Which should i do?
- Add whisper to the CI yml files
- Comment out the whisper cleaner tests
- Detect installed packages and skip test if missing
I think whisper is installed in our basic CI https://github.com/espnet/espnet/blob/master/ci/install.sh#L24, but probably it is too old?
I think whisper is installed in our basic CI https://github.com/espnet/espnet/blob/master/ci/install.sh#L24, but probably it is too old?
I see the problem now.
tools/venv/lib/python3.7/site-packages/setuptools/_distutils/msvccompiler.py:70
...
tools/chainer/chainer/_environment_check.py:7
...
The test fails for Python 3.7 because whisper won't be installed. https://github.com/espnet/espnet/blob/3ad92c2adcdb471e88bb93c0f35b12a36d6a1b51/tools/installers/install_whisper.sh#L49-L53
Should we still support Python 3.7? If so, i can amend the test script.
Should we still support Python 3.7? If so, i can amend the test script.
Yeah, but it does not go through the CI test anyway due to pytorch<1.7.0
So, as a conclusion, we can skip the test...
We can revisit it once we safely update whisper
Conda installation problems were caused by installing ffmpeg from conda-forge (see installers/install_ffmpeg_conda.sh for details). I have updated the Makefile and ffmpeg installation to solve it. The root cause is https://github.com/conda-forge/ocl-icd-feedstock/issues/29 and https://github.com/conda-forge/libarchive-feedstock/issues/69
Hi @sw005320 @Fhrozen -- would it be ok to merge now? The only CI checks that fail are
test_utils/test_compute-cmvn-stats_py.bats
test_utils/test_copy-feats_py.bats
Both are not related to this PR.
This pull request is now in conflict :(