Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Add language variations and more languages

Open Felizolinha opened this issue 2 years ago • 4 comments

While going through questions/answers in Brazilian Portuguese(pt-BR), I noticed some of them were in Portugal Portuguese (pt-PT). There are significant differences between these two variations, and grouping them with the same classification could negatively impact the performance of the assistant in Portuguese.

I'm not entirely sure about the language variations between other countries in other languages, but I suspect they're also significant enough that they should have different classifications in the dataset.

Also, there are many languages not included in the options. I don't see a good reason to limit the number of languages in the dataset, even if few people speak them. It's a great opportunity to collect even more data and make the model more capable of translating and understanding other languages.

If the UI hasn't been translated into a language yet, it's probably safe to display the UI in English, until someone translates it.

One problem that might happen, is less moderation in exotic languages. It would also probably be great to have some kind of score for the quality of a specific language dataset, as with less moderation, it can be more prone to spam attacks.

If there are no technical reasons to why more languages are not included, I'd happily work on including them, although I might need a month or so to start working on this.

Here's a (probably not exhaustive) list of languages with their country codes, which could be a kickstart for the languages to include: http://www.lingoes.net/en/translator/langcode.htm

Felizolinha avatar Feb 20 '23 20:02 Felizolinha

Related: #1449

jentrialgo avatar Feb 20 '23 21:02 jentrialgo

Unless there is a solid base of users, much of the effort would be wasted as prompts need to be labelled, ranked, replied to etc by different users to the one submitting the prompt. This is why we have generally followed the model of supporting any language where someone is willing to submit a website translation PR, as this demonstrates sufficient community interest in the language. We also have to consider moderation - if there are many languages which our team does not understand, it can introduce opportunities for spam.

olliestanley avatar Feb 20 '23 22:02 olliestanley

Current stats: image

see https://open-assistant.io/stats

andreaskoepf avatar Feb 22 '23 12:02 andreaskoepf

Current stats: image

see https://open-assistant.io/stats

These numbers are pretty small for most languages... are there any coordinated marketing efforts happening to raise awareness about OpenAssistant?

Felizolinha avatar Feb 22 '23 14:02 Felizolinha

Closing this in favour of specific issues for individual languages

olliestanley avatar Jun 12 '23 18:06 olliestanley