pythainlp Thai NLP Project backlog

If you are interested in supporting Thai Natural Language Processing (ThaiNLP), We have a backlog.

Build Open Source Text to Speech: You can add the Thai language to open source text to speech used TSynC-1 Corpus for Open Source Text to Speech. (Free for open source and the dataset is CC BY-NC-SA 3.0) or you can build text to speech corpus. (I advise you to use CC-BY license or CC-0.)
Contribute Common Voice : You can contribute Common Voice to build an open source speech recognition engine and you can add thai sentence to the Common Voice Sentence Collector.

If you are doing these things, Do not forget to inform us on this issue.

Thank you.

May 07 '20 07:05 wannaphong

A few ideas mostly still in planning phase:

tokenization together with spellcheck
autocorrect from such spellcheck
misspelling dataset
sentence (or EDU) segmentation dataset
thai word frequency dataset (and its misspelling)
OCR normalization (correct the tone, upper vowels)
Orchid corpus cleanup and standardize on UD tagset
Tokenize the whole wisesight dataset
Clean-up new TNC and use it to train POS tagger
Correct PUD treebank to the Chula UD guideline.
Add more sentences to PUD treebank
Use PUD treebank to train dependency parser for Stanza (previously StanfordNLP)
Collect and organize YouTube Thai Subtitle dataset
Auto-correct model trained from the subtitle dataset, and provide an automatic Thai subtitle creation service.

If any of these are interesting to you. I can give advice on how to get started on them.

May 07 '20 08:05 korakot

Hi, I'm interested in working on/supporting the Text-to-speech issue, please let me know if anyone else is :)

May 07 '20 15:05 pianpwk

Greetings, I'm currently looking into auto-correct model for automated Thai subtitle creation but lacking the dataset. I have about 200-500 sentences and somewhat working subtitle segmentation service(S2T transcript -> human readable sub). I think I can help gathering the subtitle dataset. I will gladly be of assistance 😄.

May 07 '20 22:05 p4perf4ce

For YouTube subtitle dataset. Here's the current resources & work-in-progress.

A script that run every hour, searching for new youtube videos that might have a Thai subtitle. See thai_sub.gs
The result of that script is saved to BigQuery kora-id.youtube.thaivdo (accessible publicly)
Another script run occasionally to filter thaivdo somewhat, and saved it to a cleaner kora-id.youtube.thaivdo2

These are human-made Thai subtitles. We can compare it to Google Speech-to-Text result. And make an auto-correct model. Here's some more scripts to make it easier.

Download subtitle from video_id: youtube_transcript.py, download_subtitle.py
Transcribe a YouTube video using Speech-to-Text: youtube_to_json.py
Use BigQuery with Python: python_client.py

To-Do

Collect statistics from thaivdo2 (and probably thaivdo), so we know which YouTube channel is a good source of Thai subtitle or not.
Collect video_id from all older videos of those channels.
Alignment algorithms between STT result and human-made subtitles.

May 08 '20 02:05 korakot

Hi, I'm interested in working on/supporting the Text-to-speech issue, please let me know if anyone else is :)

ขอใช้ภาษาไทยนะครับ ผมเคยทำ text to speech ด้วย Tacotron โดยใช้ชุดข้อมูล TSynC-1 Corpus แต่เสียงที่ได้แม้ train ไป 1 ล้าน ep ไม่สามารถสังเคราะห์เสียงออกมาเป็นประโยคได้เลยครับ แต่ก็มีคนเคยลองทำด้วย Tacotron 2 โดยใช้ชุดข้อมูลที่สร้างจาก Google text-to-speech อีกที ออกมาเสียงโอเคอยู่นะครับ https://link.medium.com/KJeFCYpck6

ส่วนผมเคยทำ text-to-speech อีกตัวที่ https://github.com/PyThaiNLP/tts-thai โดยใช้เครื่องมือของ Google ร่วมกับ TSynC-1 Corpus ออกมาโอเคอยู่ครับ แต่ผมไม่ได้ทำ G2P เลยไม่รองรับการใส่คำที่ไม่มีใน dict ลงไปสังเคราะห์ครับ

May 08 '20 16:05 wannaphong