starcoder icon indicating copy to clipboard operation
starcoder copied to clipboard

fix mask_user_labels

Open DoffeBupt opened this issue 2 years ago • 3 comments

Dear authors of Starcoder, while using the framework, I noticed that mask_user_labels sometimes does not function properly. Upon investigation, I found that there might be an issue with the function invocation. Here are my modifications:

  1. In the group_texts() function in train.py, the result["input_ids"] has a structure of lists nested within lists. However, mask_user_labels deals with a single-layered list structure. Therefore, mask_user_labels did not function as expected. I used a loop to process each label separately, allowing the function to be called normally.
  2. In the mask_user_labels function, I added masking for system-related labels. Moreover, the condition current_idx < len(labels) in the while loop should be placed at the beginning; otherwise, this condition becomes meaningless and can lead to out-of-bounds access with labels[current_idx].

DoffeBupt avatar Aug 27 '23 07:08 DoffeBupt

def group_texts(examples):
        # Concatenate all texts.
        # print(type(examples))
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        labels = concatenated_examples["input_ids"].copy()
        mask_user_labels(tokenizer, dialogue_template, labels)
        concatenated_examples['labels'] = labels
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        return result

Can we change this?

wanglongxingtianxia avatar Aug 28 '23 09:08 wanglongxingtianxia

def group_texts(examples):
        # Concatenate all texts.
        # print(type(examples))
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        labels = concatenated_examples["input_ids"].copy()
        mask_user_labels(tokenizer, dialogue_template, labels)
        concatenated_examples['labels'] = labels
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        return result

Can we change this?

change mask_user_labels(tokenizer, dialogue_template, labels) to for label in labels: mask_user_labels(tokenizer, dialogue_template, label)

I think this should make sense

DoffeBupt avatar Aug 28 '23 09:08 DoffeBupt

concatenated_examples['input_ids'] is one dimensional list

wanglongxingtianxia avatar Aug 29 '23 02:08 wanglongxingtianxia