Ivan Belonogov comments

Results 10 comments of


                                            Ivan Belonogov

"▁" character can be separated when using BPE-dropout

Hi This behaviour isn’t related to BPE-dropout. If characters `'▁'` and `'h'` did not merge, then they occurred not too many times together. This means that the algorithm instead of...

"▁" character can be separated when using BPE-dropout

Yes, I also meant this special token '▁'. (Edited the previous comment)

"▁" character can be separated when using BPE-dropout

It is not obvious to me. In practice for reasonably large vocabulary special token '▁' is almost always merged with the first symbol.

"▁" character can be separated when using BPE-dropout

YTTM is very similar to subword-nmt. The difference is the following: - In YTTM you can specify the exact number of tokens in the output vocabulary. - In Subword-nmt you...

"▁" character can be separated when using BPE-dropout

Hi, @TIXFeniks Your suggestion looks reasonable. But I don't want to make one more options for disabling this type of splits. Every new options make interface more cumbersome and decrease...

Process killed?

How much RAM do you have?

Process killed?

Yes, that looks reasonable.

[Feature] Add text normalisation as SentencePiece do

Is it common situation when datasets have significant percent non-printable characters? By significant I mean more than 0.1%. In other cases they can be easily filtered out by `coverage` option.

Add an option to predefine special tokens

We don't support this feature right now. Maybe it will be added later. You can use the following workaround. Just append this special token in the end of your training...

Tokenizing large corpus

No, there is no easy way to do it. If the training data is so large that it does not fit into memory, then most likely you can subsample random...