bricks Phonetic transcription

refinery

[ ] Tested by creator on refinery
[ ] Tested by reviewer on refinery
[ ] Ensured that output of brick conforms with refinery structure (to be checked by reviewer)

API

[x] Tested by creator on localhost:8000/docs
[ ] Tested by reviewer on localhost:8000/docs

common code

[x] Common code tested in notebook/ script by creator
[ ] Common code tested in notebook/ script by reviewer
[x] Common code contains docstrings and type hints

additional points:

[x] Docstring and README is existing
[x] Import statements (in __init__.py)
[x] (If necessary) Added dependency to requirements.txt
[ ] (If necessary) Added dependency to issue for refinery env here
[ ] Published brick to Strapi CMS (locally)

Testing procedure: When testing in refinery, please ensure that the output of the brick conforms with the structure of refinery. For extraction bricks, this would be a tuple like ("label", span_start, span_end). For classification bricks, this would be a string representing a label. For generator bricks, this would either be a float, interger, string, boolean or a list, depending on the situation.

When testing the bricks, try to avoid using only one source of data. Meaning that you should not only use the clickbait sample project, but also different texts with longer or more complex strings.

A small refinery example project with a variation of texts called bricks-test-data-project.zip can be found in the bricks repository.

Feb 08 '24 20:02 springlaughing

This one implements issue #278.

Hello, trying to make another brick, this time - phonetic transcriptor. There are some things to note about this one:

In general, Linux or WSL required (at least for English due to Flite)
CEDICT .txt file is required for Chinese

Here are steps to organize the environment to run the package:

Install epitran: pip install epitran Install jieba: pip install jieba

Get Flite for English: git clone http://github.com/festvox/flite cd flite ./configure make sudo make install cd testsuite make lex_lookup sudo cp lex_lookup /usr/local/bin

Get Cedict for Chinese: https://www.mdbg.net/chinese/dictionary?page=cedict - download and unpack, provide this path to cedict_path inside the phonetic_transcriptor function.

Feb 08 '24 20:02 springlaughing

Hi @springlaughing, thank you for the contribution! Code looks good so far, will test more thoroughly, though. As this brick will require some dependencies to be installed, we will most likely wait until the next release to merge this, as our dev team can then also add the requirements to our tool refinery for the bricks integration. Do you know if flite is definitely needed, or if only epitran or jieba are needed for this? :)

Feb 14 '24 14:02 LeonardPuettmannKern

Hi @springlaughing, thank you for the contribution! Code looks good so far, will test more thoroughly, though. As this brick will require some dependencies to be installed, we will most likely wait until the next release to merge this, as our dev team can then also add the requirements to our tool refinery for the bricks integration. Do you know if flite is definitely needed, or if only epitran or jieba are needed for this? :)

Yes, Flite is needed to be able to use epitran to get phonetic transcriptions for English language, here is the screenshot from epitran Github page https://github.com/dmort27/epitran: Another thing is with Chinese: Cedict is needed to be able to use epitran for getting phonetic transcriptions for Chinese, as mentioned on the epitan page: Additionally, I have used jieba as tokenizer for Chinese, but it shouldn't be a problem as it is a simple dependency install and MIT Licence. :)

Feb 15 '24 01:02 springlaughing