dolly RuntimeError: Could not find response key token IDs when using bloom model and tokenizer to train

it work fine if i use gpt-j

i guess this because of tokenizer and this https://github.com/databrickslabs/dolly/blob/03bf3852daa42e6091a39483dda0714c02de7382/training/trainer.py#L52

any tips to adjust it so it can use other model than gpt-j ?

thanks

Mar 25 '23 12:03 acul3

Yes probably the code as currently written depends on the way this particular tokenizer works. I suggest printing out response_token_ids and batch["labels"][i] in order to help identify why it isn't finding the sequence. You can separately run tokenizer = load_tokenizer() and then run tokenizer.decode on some parts of the sequence to debug. It would actually be helpful if the error included this in the formatted string.

Mar 25 '23 15:03 matthayes

I think we might be able to make the code more robust to this as well so it works on other tokenizers. I'll look into it. Can you share the repro steps? For example which model are you using?

Mar 25 '23 15:03 matthayes

@matthayes

i'm just changing this line to model that i want https://github.com/databrickslabs/dolly/blob/03bf3852daa42e6091a39483dda0714c02de7382/training/trainer.py#L35

so far i tried bloom and xglm,, got same error

Mar 26 '23 07:03 acul3

I hit this too, but not sure if it was for the same reason. I have different input, and didn't format it exactly like the alpaca dataset. In particular, it will have to include ### Response\n for example.

Mar 28 '23 19:03 srowen

I believe this is resolved in Matt's changes from a few days ago anyway.

Mar 30 '23 21:03 srowen

I just tested this in bloom with recent code and was able to reproduce the error still.

Mar 30 '23 21:03 matthayes

I see what is causing the issue. For the bloom model the tokenizer is combining the newline at the end of ### Response:\n with the next character, resulting in a different character. This doesn't happen with the gpt-j tokenizer. As a result the token IDs for ### Response:\n are not found exactly.

Mar 30 '23 21:03 matthayes

I actually just merged a change today that I think will make this easier to fix. Currently ### Response: becomes just a single token, as I made it a special token. I think I can update this so that the single token is ### Response:\n. This should prevent the newline from being combined with what follows.

Mar 30 '23 21:03 matthayes

I've merged in the fix which enables this to use bloom.

Mar 30 '23 23:03 matthayes