after tokenizing with trained tokenizer, the "tokens" array contains original tokens
after tokenizing a song with a trained tokenizer, the "tokens" array contains only the base tokens, the "ids" array is fine containing newly generated vocab, i was wondering if this was design choice or bug
Hi, This is a design choice (i.e. to only alter the ids) as the main purpose of encoding the sequence is to fed the ids to a model. If you really need to explore what encoded ids are made of, you can always use the vocabulary dictionaries to convert the encoded ids https://github.com/Natooz/MidiTok/blob/main/miditok/midi_tokenizer.py#L111
This issue is stale because it has been open for 30 days with no activity.
This issue is stale because it has been open for 30 days with no activity.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.