starcoder
starcoder copied to clipboard
fix mask_user_labels
Dear authors of Starcoder, while using the framework, I noticed that mask_user_labels sometimes does not function properly. Upon investigation, I found that there might be an issue with the function invocation. Here are my modifications:
- In the group_texts() function in train.py, the result["input_ids"] has a structure of lists nested within lists. However, mask_user_labels deals with a single-layered list structure. Therefore, mask_user_labels did not function as expected. I used a loop to process each label separately, allowing the function to be called normally.
- In the mask_user_labels function, I added masking for system-related labels. Moreover, the condition current_idx < len(labels) in the while loop should be placed at the beginning; otherwise, this condition becomes meaningless and can lead to out-of-bounds access with labels[current_idx].
def group_texts(examples):
# Concatenate all texts.
# print(type(examples))
concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
if total_length >= block_size:
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
labels = concatenated_examples["input_ids"].copy()
mask_user_labels(tokenizer, dialogue_template, labels)
concatenated_examples['labels'] = labels
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
return result
Can we change this?
def group_texts(examples): # Concatenate all texts. # print(type(examples)) concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can # customize this part to your needs. if total_length >= block_size: total_length = (total_length // block_size) * block_size # Split by chunks of max_len. labels = concatenated_examples["input_ids"].copy() mask_user_labels(tokenizer, dialogue_template, labels) concatenated_examples['labels'] = labels result = { k: [t[i : i + block_size] for i in range(0, total_length, block_size)] for k, t in concatenated_examples.items() } return resultCan we change this?
change mask_user_labels(tokenizer, dialogue_template, labels) to for label in labels: mask_user_labels(tokenizer, dialogue_template, label)
I think this should make sense
concatenated_examples['input_ids'] is one dimensional list