[Hubert] Use different kmeans models for train and valid dataset?

Open amadeusuzx opened this issue 1 year ago • 0 comments

Hi there,

For hubert pretraining, the data preparation guide here indicates that a kmeans model should be trained on each of the train and valid data and should be used to produce the initial clusters and targets. However, I find doing so causes high and increasing valid loss, which goes down just like the train loss when I generated the valid targets using the kmeans model from the train set. My understanding is that kmeans is a very random and data-dependent process so the clusters can be very different even if it is trained on two datasets from the same data distribution.

What is the best way to prepare the initial kmeans targets? Any advice would be appreciated.

Mar 05 '24 16:03 amadeusuzx