Thai NLP Project backlog
If you are interested in supporting Thai Natural Language Processing (ThaiNLP), We have a backlog.
- Build Open Source Text to Speech: You can add the Thai language to open source text to speech used TSynC-1 Corpus for Open Source Text to Speech. (Free for open source and the dataset is CC BY-NC-SA 3.0) or you can build text to speech corpus. (I advise you to use CC-BY license or CC-0.)
- Contribute Common Voice : You can contribute Common Voice to build an open source speech recognition engine and you can add thai sentence to the Common Voice Sentence Collector.
If you are doing these things, Do not forget to inform us on this issue.
Thank you.
A few ideas mostly still in planning phase:
- tokenization together with spellcheck
- autocorrect from such spellcheck
- misspelling dataset
- sentence (or EDU) segmentation dataset
- thai word frequency dataset (and its misspelling)
- OCR normalization (correct the tone, upper vowels)
- Orchid corpus cleanup and standardize on UD tagset
- Tokenize the whole wisesight dataset
- Clean-up new TNC and use it to train POS tagger
- Correct PUD treebank to the Chula UD guideline.
- Add more sentences to PUD treebank
- Use PUD treebank to train dependency parser for Stanza (previously StanfordNLP)
- Collect and organize YouTube Thai Subtitle dataset
- Auto-correct model trained from the subtitle dataset, and provide an automatic Thai subtitle creation service.
If any of these are interesting to you. I can give advice on how to get started on them.
Hi, I'm interested in working on/supporting the Text-to-speech issue, please let me know if anyone else is :)
Greetings, I'm currently looking into auto-correct model for automated Thai subtitle creation but lacking the dataset. I have about 200-500 sentences and somewhat working subtitle segmentation service(S2T transcript -> human readable sub). I think I can help gathering the subtitle dataset. I will gladly be of assistance 😄.
For YouTube subtitle dataset. Here's the current resources & work-in-progress.
- A script that run every hour, searching for new youtube videos that might have a Thai subtitle. See thai_sub.gs
- The result of that script is saved to BigQuery
kora-id.youtube.thaivdo(accessible publicly) - Another script run occasionally to filter
thaivdosomewhat, and saved it to a cleanerkora-id.youtube.thaivdo2
These are human-made Thai subtitles. We can compare it to Google Speech-to-Text result. And make an auto-correct model. Here's some more scripts to make it easier.
- Download subtitle from video_id: youtube_transcript.py, download_subtitle.py
- Transcribe a YouTube video using Speech-to-Text: youtube_to_json.py
- Use BigQuery with Python: python_client.py
To-Do
- Collect statistics from
thaivdo2(and probablythaivdo), so we know which YouTube channel is a good source of Thai subtitle or not. - Collect video_id from all older videos of those channels.
- Alignment algorithms between STT result and human-made subtitles.
Hi, I'm interested in working on/supporting the Text-to-speech issue, please let me know if anyone else is :)
ขอใช้ภาษาไทยนะครับ ผมเคยทำ text to speech ด้วย Tacotron โดยใช้ชุดข้อมูล TSynC-1 Corpus แต่เสียงที่ได้แม้ train ไป 1 ล้าน ep ไม่สามารถสังเคราะห์เสียงออกมาเป็นประโยคได้เลยครับ แต่ก็มีคนเคยลองทำด้วย Tacotron 2 โดยใช้ชุดข้อมูลที่สร้างจาก Google text-to-speech อีกที ออกมาเสียงโอเคอยู่นะครับ https://link.medium.com/KJeFCYpck6
ส่วนผมเคยทำ text-to-speech อีกตัวที่ https://github.com/PyThaiNLP/tts-thai โดยใช้เครื่องมือของ Google ร่วมกับ TSynC-1 Corpus ออกมาโอเคอยู่ครับ แต่ผมไม่ได้ทำ G2P เลยไม่รองรับการใส่คำที่ไม่มีใน dict ลงไปสังเคราะห์ครับ