Hands-On-Large-Language-Models icon indicating copy to clipboard operation
Hands-On-Large-Language-Models copied to clipboard

Chapter 10, page 306

Open xyang2013 opened this issue 1 year ago • 3 comments

Should soft_negatives defined as follow instead?

def deranged_shuffle(original):
    while True:
        shuffled = original.copy()
        random.shuffle(shuffled)
        if all(o != s for o, s in zip(original, shuffled)):
            return shuffled

mnli = mnli.filter(lambda x: True if x["label"] == 0 else False)

# Prepare data and add a soft negative
train_dataset = {"anchor": [], "positive": [], "negative": []}
# soft_negatives = mnli["hypothesis"]
# random.shuffle(soft_negatives)
soft_negatives = deranged_shuffle(mnli["hypothesis"])
for row, soft_negative in tqdm(zip(mnli, soft_negatives)):
    train_dataset["anchor"].append(row["premise"])
    train_dataset["positive"].append(row["hypothesis"])
    train_dataset["negative"].append(soft_negative)
train_dataset = Dataset.from_dict(train_dataset)

xyang2013 avatar Apr 20 '25 00:04 xyang2013

Could you perhaps share what the difference is and why you believe it should be changed? Also, note that the code you shared is difficult to read. A tip is to ``` brackets so that the code keeps it structure.

MaartenGr avatar Apr 22 '25 11:04 MaartenGr

Sorry for the format.

I was thinking with random.shuffle, there could be instances whose positions remain the same after the shuffling.

xyang2013 avatar Apr 22 '25 12:04 xyang2013

It indeed theoretically could be. I'm wondering with the amount of data how often that would actually happen and what the impact on the results would be. It would be a nice experiment ;)

With 50k examples, I can imagine that happens quite seldom and in that case, using random.shuffle should suffice.

MaartenGr avatar Apr 30 '25 06:04 MaartenGr