Chapter 10, page 306
Should soft_negatives defined as follow instead?
def deranged_shuffle(original):
while True:
shuffled = original.copy()
random.shuffle(shuffled)
if all(o != s for o, s in zip(original, shuffled)):
return shuffled
mnli = mnli.filter(lambda x: True if x["label"] == 0 else False)
# Prepare data and add a soft negative
train_dataset = {"anchor": [], "positive": [], "negative": []}
# soft_negatives = mnli["hypothesis"]
# random.shuffle(soft_negatives)
soft_negatives = deranged_shuffle(mnli["hypothesis"])
for row, soft_negative in tqdm(zip(mnli, soft_negatives)):
train_dataset["anchor"].append(row["premise"])
train_dataset["positive"].append(row["hypothesis"])
train_dataset["negative"].append(soft_negative)
train_dataset = Dataset.from_dict(train_dataset)
Could you perhaps share what the difference is and why you believe it should be changed? Also, note that the code you shared is difficult to read. A tip is to ``` brackets so that the code keeps it structure.
Sorry for the format.
I was thinking with random.shuffle, there could be instances whose positions remain the same after the shuffling.
It indeed theoretically could be. I'm wondering with the amount of data how often that would actually happen and what the impact on the results would be. It would be a nice experiment ;)
With 50k examples, I can imagine that happens quite seldom and in that case, using random.shuffle should suffice.