Tricky Results - Potential Bug
Hi,
I recently run LCA with measurement = binary, the results show there were 13 classess in total, however, I found there were 6 (i.e., classes: 1,2,4,5,6,9) classes are exactly same according to model.get_mm_df(). Then, I went to on model.predict(X), I found 1,2,4,5,9 class labels were missing, there were not any data (x) assigned to these classes. So, I manully merged them.
Also, I checked the crosstab, the forementioned classes were missing as well. The number of classes in total is identified by grid search, I assume 13 can produce better metric value, but the fact is there were only 8 classes in total.
Does anyone know the reason?
Thanks for reporting this.
- Can you check the observations from classes 1,2,4,5,6,9? Specifically, are they identical or extremely similar?
- Have you tried fitting an estimator with fewer classes? I would consider setting
n_components=8. - Some classes never getting predicted can happen. The class prediction is an argmax over the probability of belonging to each class. You can check those probabilities directly with
predict_proba.
Thanks for reporting this.
- Can you check the observations from classes 1,2,4,5,6,9? Specifically, are they identical or extremely similar?
- Have you tried fitting an estimator with fewer classes? I would consider setting
n_components=8.- Some classes never getting predicted can happen. The class prediction is an argmax over the probability of belonging to each class. You can check those probabilities directly with
predict_proba.
Hi,
- I could not check the observations from 1,2,4,5, and 9, because no observation is classfied with these labels. I checked observations in class 6, yes, they are identical.
- Yes, I tried grid search for the parameter of class number, it shows 13 is the best one. Also, I tried 8, then it only gives me 5 classes in crosstab.
- Thanks for your answer, I will check, much appreciated for great work, I like Stepmix.
Given that the 6 classes are identical in terms of parameters, you should see very similar probabilities in predict_proba for the observations that get assigned to class 6. I suspect 6 gets predicted essentially because it's numerically slightly more likely.
What seems to be happening here is that multiple classes latch on to the same data cluster.
I would consider testing different validation metrics, including AIC or BIC to penalize unnecessarily complex models. You can also plot metrics for validation with different components (we did something similar in this tutorial). 13 components might get selected as the best fit, but you might observe an elbow at n_components < 13 and then a plateau with negligible improvements.
@yuanjames are you still stuck with this? I will close, but feel free to reopen if needed.