Hands-On-Large-Language-Models Chapter 10, page 306

Should soft_negatives defined as follow instead?

def deranged_shuffle(original):
    while True:
        shuffled = original.copy()
        random.shuffle(shuffled)
        if all(o != s for o, s in zip(original, shuffled)):
            return shuffled

mnli = mnli.filter(lambda x: True if x["label"] == 0 else False)

# Prepare data and add a soft negative
train_dataset = {"anchor": [], "positive": [], "negative": []}
# soft_negatives = mnli["hypothesis"]
# random.shuffle(soft_negatives)
soft_negatives = deranged_shuffle(mnli["hypothesis"])
for row, soft_negative in tqdm(zip(mnli, soft_negatives)):
    train_dataset["anchor"].append(row["premise"])
    train_dataset["positive"].append(row["hypothesis"])
    train_dataset["negative"].append(soft_negative)
train_dataset = Dataset.from_dict(train_dataset)

Apr 20 '25 00:04 xyang2013

Could you perhaps share what the difference is and why you believe it should be changed? Also, note that the code you shared is difficult to read. A tip is to ``` brackets so that the code keeps it structure.

Apr 22 '25 11:04 MaartenGr

Sorry for the format.

I was thinking with random.shuffle, there could be instances whose positions remain the same after the shuffling.

Apr 22 '25 12:04 xyang2013

It indeed theoretically could be. I'm wondering with the amount of data how often that would actually happen and what the impact on the results would be. It would be a nice experiment ;)

With 50k examples, I can imagine that happens quite seldom and in that case, using random.shuffle should suffice.

Apr 30 '25 06:04 MaartenGr