enstop icon indicating copy to clipboard operation
enstop copied to clipboard

Call to plsa_refit fails due to missing sample_weight

Open ydennisy opened this issue 5 years ago • 13 comments

When using model.transform() on new unseen data, the following error occurs:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-30746b9aeea8> in <module>
      1 test_corpus = df1['cleaned_text'].tolist()
      2 test_dtm = vectorizer.transform(test_corpus)
----> 3 test_doc_vecs = model.transform(test_dtm)
      4 labels = np.argmax(test_doc_vecs, axis=1)
      5 

/opt/conda/lib/python3.7/site-packages/enstop/enstop_.py in transform(self, X, y)
    836             n_iter_per_test=5,
    837             tolerance=0.001,
--> 838             random_state=random_state,
    839         )
    840 

TypeError: plsa_refit() missing 1 required positional argument: 'sample_weight'

There seems to be a missing arg here.

Seems a simple fix - I would be happy to make a PR, but I am not sure how to derive the needed arg:

    sample_weight: array of shape (n_docs,)
        Input document weights.

If @lmcinnes you can shed some light here - could be a quick fix!

ydennisy avatar Sep 05 '20 17:09 ydennisy

I thought, in fact, that this is something that had been fixed. The latest version in the repository should, I think, handle this correctly. I just haven't made a release to PyPI recently. Can you verify that the current master works for you on this issue?

lmcinnes avatar Sep 05 '20 17:09 lmcinnes

@lmcinnes thanks for the quick reply! I will give it a go - but I linked to master, which does seem to be missing the args needed.

ydennisy avatar Sep 05 '20 17:09 ydennisy

@lmcinnes I can confirm this issue is still occurring even after installing using:

pip install -U git+https://github.com/lmcinnes/enstop.git@067a813b14a95cb14bd87e263cfe0305b49bf03f which is referencing the most recent commit on master.

ydennisy avatar Sep 05 '20 17:09 ydennisy

Hmm, that's less good then :-( There is, at least, theoretically code that guards around this now. I'll have a look soon.

lmcinnes avatar Sep 05 '20 17:09 lmcinnes

@lmcinnes let me know if you can give a little guidance - and I will be happy to try and help out here.

The topics I got on the first run were solid - so would love to continue using enstop!

ydennisy avatar Sep 06 '20 14:09 ydennisy

The short answer is that the sample_weights should effectively simply be all ones unless otherwise specified. In the newer code there is actually a bifurcation and a separate version of the m-step of the em-algorithm that uses sample weights and one that doesn't. Ideally if no sample weights are given we should be taking the latter path and not using sample weights at all.

lmcinnes avatar Sep 06 '20 16:09 lmcinnes

@lmcinnes just to let you know I had a fix running locally which would always set the values of sample_weights to 1s, however there are issues with the kernel being restarted / dying also in this library - for which I am not able to find a cause or the error log.

ydennisy avatar Sep 29 '20 15:09 ydennisy

Hmm, that's more disconcerting. I have been doing a lot of messing around with it a month ago and may have subtly broken something. I'll try to get some time to look at all of this eventually. I certainly appreciate the feedback.

lmcinnes avatar Sep 30 '20 16:09 lmcinnes

@lmcinnes appreciate your replies.

Can you give a rough estimate as to when you think you can take a look?

ydennisy avatar Oct 07 '20 09:10 ydennisy

Right now I have many other pressing things. I am unlikely to get to this until November at the earliest. As a workaround in the meantime you can force the sample weights to be ones. Ideally this shouldn't be a problem at all, so there is something astray somewhere, but setting a set of all one sample weights will brute force a workaround in the meantime while I try to figure out the underlying fix.

lmcinnes avatar Oct 10 '20 04:10 lmcinnes

I would be happy to do this part - the issue with the kernel dying I am not really sure how to fix.

I will play around and report - if I find anything.

ydennisy avatar Oct 23 '20 11:10 ydennisy

@lmcinnes Do you think this library is something you plan to maintain?

ydennisy avatar Apr 24 '21 15:04 ydennisy

It is in a bit of limbo right now as I have had to step away from this library to work on other topics.

lmcinnes avatar Apr 26 '21 18:04 lmcinnes