engram How to create engram for Polish?

Is it hard to get engram layout for language like polish? I created my layout based on engram, but maybe it can be further optimized.

ukladpl ukladplAltGr

Sep 08 '22 10:09 AKmatiAK

I updated my layout, here it is:

FWYR2 altgr layer remains the same

Sep 09 '22 21:09 AKmatiAK

Please take a look at the optimized layout for Spanish (https://github.com/binarybottle/engram-es) that we created.

If there is very accurate and representative 1-gram and bigram frequency data for the Polish language (including symbols), then we could apply a modified version of the code to generate an Engram layout optimized for Polish.

Sep 09 '22 23:09 binarybottle

Hi. I contacted polish corpus creators and I got data up to 5-grams. It's available here: n-grams pl data. Do I need to further process it or it's enough?

also, here is list of avaible resources that might be useful: link

Sep 10 '22 19:09 AKmatiAK

This corpus looks pretty official! I like that it has a broad variety of book and news sources. Too bad it doesn't include spoken transcripts or social media sources. Anyway, I would be happy to help with this but it will be a couple of months before I can get to it -- buried with projects right now.

Sep 12 '22 15:09 binarybottle

Ok, thank u for help ;)

Sep 12 '22 20:09 AKmatiAK

@iandoug -- Given your experience helping to clean up the Spanish corpus, do you have any concerns about the proposed Polish corpus?: http://zil.ipipan.waw.pl/NKJPNGrams

Sep 30 '22 02:09 binarybottle

Hi Arno

For keyboard layout use, I prefer to strip texts not normally typed on computer keyboard (like spoken transcripts or tweets) because that will mess up the character frequencies and n-grams.

"Each unigram is maximum continuous chunk of non-whitespace lower-case characters."

That is the normal way of doing it. Ian of course is not normal and does it Case Sensitive ... :-) Because typing Th is different to th.

It looks like they only have "1-million-word subcorpus" available to download.

Is this typical Polish text?

Masz ty duszę? Powiedz! Tak jest. Od ręki. To chyba dobra formuła.

Do lines starting with a dash indicate dialogue? There seems to be a lot of that, which is going to bump up the dash frequency. Should maybe strip out leading dashes.

I have not seen any HTML etc, but since this is "manually annotated", I guess it is "clean" in that regard. There are some ALL CAPS sentences.

Will see if I can extract the text and do some analysis over the weekend. We are currently having rolling blackouts so that messes up plans.

Sep 30 '22 06:09 iandoug

For future reference: Leipzig

https://wortschatz.uni-leipzig.de/en/download/Polish#pol_newscrawl_2011

Sep 30 '22 07:09 iandoug

What's the difference in quotes? Should both be on keyboard?

która znalazła się w zestawieniu "Billboard Magazine".

” 2012, a zespół otrzymał nominację do nagród „Songlines Music Awards” 2012 w kategorii „Best Group”. ” (2012) oraz realizator dźwięku przy filmie „

Sep 30 '22 07:09 iandoug

Do lines starting with a dash indicate dialogue? There seems to be a lot of that, which is going to bump up the dash frequency. Should maybe strip out leading dashes.

yes, they're for dialogues, yet not often used for things other than books.

What's the difference in quotes? Should both be on keyboard?

bottom quotes are rarely used, shouldn't be on keyboard (they are now superseeded by both upper quotes and meaning is the same)

Is this typical Polish text?

Masz ty duszę? Powiedz! Tak jest. Od ręki. To chyba dobra formuła.

2,3 typical, 1 is correct but rather from books

Sep 30 '22 08:09 AKmatiAK

Thank you for taking a look, @iandoug! I don't know Polish, so I will defer to @AKmatiAK and other Polish speakers/typists. A corpus of only 1 million words is pretty small, but I hope it represents what people type.

Sep 30 '22 14:09 binarybottle

I took a look at the linked corpus, not wild about it, seems to contain a lot of dialogue. Will try cleaning up some of it as next step after this.

Instead, I grabbed all the1M files from the Leipzig Polish corpus. After looking at those, decided to only use the "news" files, the rest is going to be a mess to clean. So that supplies 9 million sentences.

After tweaking my Spanish cleanup program, now have a 688 MB text file to play with. I grabbed some Polish books from Gutenbreg ... only a few, most seem to be poetry or dialogue-heavy novels. Will try my usual "extract some text" approach with those to add to the Leipzig file.

Current char distribution looks like this. Provisional list, may change ...

char-dist-1.txt

polishfreq1.txt

Oct 01 '22 14:10 iandoug

Gonna have to do this on ISO not ANSI ... need that extra key. Even then, going to be challenging. :-)

112 characters?

Oct 01 '22 14:10 iandoug

@iandoug -- Using news files sounds reasonable, but I wouldn't throw out dialogues -- they are far closer to how people type emails than books are.

Oct 01 '22 16:10 binarybottle

I took further look at NJKP n-grams and they're heavily bloated with parliament sessions transcriptions or something like this, so they're pretty useless. news/internet is the way to go. I'll take a look at leipzig files.

Oct 01 '22 16:10 AKmatiAK

Sample from "Web" corpus attached.

Will do your "single-case" frequencies and bigrams in due course.

web-sample.txt

The dialogues all like this:

% short sentence 1. % short sentence 2. % short sentence 3.

where % is the - character. Markdown getting in the way again.

Oct 01 '22 16:10 iandoug

idk how to read n-grams from leipzig. Is there any instruction for this?

Gonna have to do this on ISO not ANSI ... need that extra key. Even then, going to be challenging. :-)

112 characters?

~~We use ISO keyboard, same as in US here.~~ Both ISO and ANSI. Polish characters on altgr. 112 characters without space and enter

Oct 01 '22 16:10 AKmatiAK

First attempt at bigrams. Am playing with trial layout, I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.

Also UDHR in Polish as temporary test file. The UN no longer seems to have .txt downloads, just PDF on on web page.

udhr-polish.txt bigrams-polish1.csv

csv is tab-separated.

Most common: ie ni na ow st ze cz rz po ch an ra pr wi zy ro ia za wa ta dz sz od ki en ko ar ej mi li ci zi ac

Oct 01 '22 16:10 iandoug

I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.

honestly I bought one ANSI keyboard and don't like it, but it shouldn't be a problem to make two variants of layout? I checked what layout currently sold in Poland keyboards have and no standard is followed, both ISO and ANSI are used (not like I previously wrote, I though that ISO is standard here).

Oct 01 '22 17:10 AKmatiAK

I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.

honestly I bought one ANSI keyboard and don't like it, but it shouldn't be a problem to make two variants of layout? I checked what layout currently sold in Poland keyboards have and no standard is followed, both ISO and ANSI are used (not like I previously wrote, I though that ISO is standard here).

Yeah, I was surprised at what Wikipedia said about that. At the moment I have enough keys on ANSI, though must put Euro somewhere. Spanish and French more tricky because multiple diacritics per vowel. You only have 2 on Z.

Here's first attempt at chained bigrams, since the UDHR character frequency is not very good. But not happy with this file either, has too many digits. Which is a consequence of the "news" input I suppose.

polishmonkeytest.txt

Oct 01 '22 17:10 iandoug

Okay finally got somewhere but it "feels" a mess, probably because I know nothing about Polish. But it will give you something to compare against.

Ignore the layouts with .en. in the name, they are missing the Polish letters so their scores are wrong.

The bottom one is the "Programmer" layout which WP says is the most common. It might make sense to put some of these letters on their own keys, instead of Q V X since these 3 are not native Polish and thus rare. Or at least ł.

ł ord 197 hex c5 8602003 ż ord 197 hex c5 4490722 ó ord 195 hex c3 4461711 ś ord 197 hex c5 2988511 ć ord 196 hex c4 2271532

polish-test-1 ian pl ansi

Oct 01 '22 19:10 iandoug

Enough for today. Getting better. polish-test-2 ian2 pl ansi

Oct 01 '22 19:10 iandoug

Been playing around. Current best version, changes my be too dramatic for easy acceptance.

Can compare performance against default Programmer version at bottom of list. Ignore .en. layouts.

polish-test-3 ian8 pl ansi

Oct 02 '22 15:10 iandoug

Think I need the diacritic S letters on separate key, which means switching to ISO form factor.

Oct 02 '22 15:10 iandoug

Hand balance is 58:42, but can't find spot on right for popular letters on left ...

Oct 02 '22 15:10 iandoug

This is my current layout I was creating since about month by simple intuition and applying fixes based on what I thought should be changed etc. so it might be useful to some extent in designing engram-pl. It lacks some keys I know because I changed it frequently. in my subjecive opinion, cie trigram is very frequent and should be placed on keyboard (but I may be wrong). Also, mixing different letters on one key is not very good idea imo, it might be faster but is unintuitive. only ź should be placed on another letter, also placing ł on i instead of L is reasonable because i found it easy to remember somehow.

btw: what I like a lot in ISO is far better thumb access to altgr and one more letter at home row. I couldn't achieve it on ANSI and because that I sticked with my old ISO one. keyboard-layout(2)

of course I have caps and ctrl swapped ;)

Oct 02 '22 17:10 AKmatiAK

Mmm so of course you would use a form factor that is not in KLA ..... neither ANSI nor ISO :-)

sz is a common bigram so should not be on same finger.

Q V X are not in your alphabet so it makes no sense to waste whole keys on them. They are only there because of QWERTY.

I made an ISO version, realised I had the spacebar on the wrong thumb, so had to basaclly mirror the layout to fix it.

Hand balance is nearly perfect now. ANSI version slightly better, but ISO puts the space bars further away and there's nothing I can do about that. Other metrics are better. ian10 pl iso ian10 pl ansi polish-test-4

I may have used the wrong input file to create the chained bigrams, so redid it.

polishmonkeytest2.txt

Oct 02 '22 20:10 iandoug

The Q X V can be put in better places ... first get the Polish to work :-)

Oct 02 '22 20:10 iandoug

@iandoug -- Thank you for hitting this hard over the weekend! I am slammed this week but hope to take a look at what you're doing next weekend.

Oct 03 '22 01:10 binarybottle

Was not intending to but once you start fiddling with layouts ... like a drug :-)

Also have other stuff to do this week, will ty to improve corpus when I have time.

Oct 03 '22 06:10 iandoug