Consider modifying `early_stop` threshold
I've recently run into the situation where Harmony would run 2 iterations and then finish, sometimes with decent results, other times with very poor results. Setting early_stop = FALSE and max_iter = 20 produces good results, but I was wondering if setting a smaller threshold for early stop would provide better default behavior (e.g. 1e-8). Since Harmony remains extremely fast and resource friendly I don't think it would cause any issue for users to have it try a little harder to find a good solution.
Best Dave
Hi @doliv071
Can you share with us a convergence plot? Sometimes, setting the early_stop=FALSE makes the objective diverge and this can potentially leave data overintegrated. Still, this is data-dependent; hence, this fail-safe setting, the early_stop mechanism, is used to prevent further divergence.
I agree we can demonstrate this workflow you outlined, that is setting early_stop=FALSE and then tweaking the max_iteration to make sure we have extracted the most optimal embeddings from the algorithm. Happy to accept and incorporate a vignette where you show this.
Hi @pati-ni,
I'd be happy to add a section to the vignette about this. Although, I think I will need to do some more thorough reading of the detailed walkthrough to see how best to make it fit.
I've attached a the two convergence plots (one with early stop and one without and 20 iterations).
@doliv071 The plots you posted are interesting. I would not rule out the scenario of overcorrection. Can you describe briefly the data such as batch sizes and experimental design?
I am interested in what overcorrection might look like or how you might diagnose it without apriori knowledge of the system.
In this dataset I have two different different tendon types with 4 samples each. In general, I have no reason to believe that one tendon might have a significant difference in cell populations so my check for whether things are working is to visualize the corrected PCs by tissue type.
Uncorrected:
Early Stop:
20 Iterations:
-
Could you share some UMAPs so I can gain a better understanding of the dataset's complexity? The first two PCs have limited capacity to capture the full range of cell types in the dataset.
-
Given your prior statement that "no significant differences are expected", I have also been developing a newer version, which we plan to push upstream soon. You are welcome to beta test it if you wish, and we would be interested in seeing the outcome.
-
Additionally, if your batches are imbalanced in terms of cell numbers, it may be worth setting lambda to NULL.