harmony icon indicating copy to clipboard operation
harmony copied to clipboard

Consider modifying `early_stop` threshold

Open doliv071 opened this issue 10 months ago • 5 comments

I've recently run into the situation where Harmony would run 2 iterations and then finish, sometimes with decent results, other times with very poor results. Setting early_stop = FALSE and max_iter = 20 produces good results, but I was wondering if setting a smaller threshold for early stop would provide better default behavior (e.g. 1e-8). Since Harmony remains extremely fast and resource friendly I don't think it would cause any issue for users to have it try a little harder to find a good solution.

Best Dave

doliv071 avatar Apr 14 '25 18:04 doliv071

Hi @doliv071

Can you share with us a convergence plot? Sometimes, setting the early_stop=FALSE makes the objective diverge and this can potentially leave data overintegrated. Still, this is data-dependent; hence, this fail-safe setting, the early_stop mechanism, is used to prevent further divergence.

I agree we can demonstrate this workflow you outlined, that is setting early_stop=FALSE and then tweaking the max_iteration to make sure we have extracted the most optimal embeddings from the algorithm. Happy to accept and incorporate a vignette where you show this.

pati-ni avatar Apr 21 '25 14:04 pati-ni

Hi @pati-ni,

I'd be happy to add a section to the vignette about this. Although, I think I will need to do some more thorough reading of the detailed walkthrough to see how best to make it fit.

I've attached a the two convergence plots (one with early stop and one without and 20 iterations).

With Early Stop

Without Early Stop

doliv071 avatar Apr 24 '25 14:04 doliv071

@doliv071 The plots you posted are interesting. I would not rule out the scenario of overcorrection. Can you describe briefly the data such as batch sizes and experimental design?

pati-ni avatar Apr 24 '25 15:04 pati-ni

I am interested in what overcorrection might look like or how you might diagnose it without apriori knowledge of the system.

In this dataset I have two different different tendon types with 4 samples each. In general, I have no reason to believe that one tendon might have a significant difference in cell populations so my check for whether things are working is to visualize the corrected PCs by tissue type.

Uncorrected: Image

Early Stop: Image

20 Iterations: Image

doliv071 avatar Apr 24 '25 15:04 doliv071

  1. Could you share some UMAPs so I can gain a better understanding of the dataset's complexity? The first two PCs have limited capacity to capture the full range of cell types in the dataset.

  2. Given your prior statement that "no significant differences are expected", I have also been developing a newer version, which we plan to push upstream soon. You are welcome to beta test it if you wish, and we would be interested in seeing the outcome.

  3. Additionally, if your batches are imbalanced in terms of cell numbers, it may be worth setting lambda to NULL.

pati-ni avatar Apr 24 '25 20:04 pati-ni