saul icon indicating copy to clipboard operation
saul copied to clipboard

training a classifier should overwrite the .lex

Open kordjamshidi opened this issue 8 years ago • 9 comments

It seems if the .lex of a classifier has been created before and exists in the default path when we retrain the classifiers it adds features to the same lexicon, that is, the lexicon is not overwritten.
(We need tests for load, save and when classifiers are created from scratch. related to #411 )

kordjamshidi avatar Aug 03 '17 20:08 kordjamshidi

@danyaljj do you have any comments on this?

kordjamshidi avatar Aug 04 '17 02:08 kordjamshidi

Just to clarify it, are you saying that training a model would write on disk (lexicon file), before/without calling save()?

danyaljj avatar Aug 04 '17 17:08 danyaljj

No, with or without save is not an issue. The issue is when there exists a lex anyhow from the past, the train() just uses that and adds new features to it that leads to exploding the lex size as we run the app and train() frequent times (in different independent runs).

kordjamshidi avatar Aug 04 '17 17:08 kordjamshidi

I see. So you think we should always remove lexicon file, at the beginning of train?

danyaljj avatar Aug 04 '17 17:08 danyaljj

I expected it to be overwritten by default, we need to indicate if we want to continue training or need to train from scratch. Because removing those at the beginning of the train will be problematic in case we want to initialize models with existing lex and lc.

kordjamshidi avatar Aug 04 '17 17:08 kordjamshidi

Right I agree it's tricky. We can ask the user at the beginning of the training:

Do you want to remove existing model files? [Y/N]

What do you think?

danyaljj avatar Aug 04 '17 17:08 danyaljj

Sounds good to me. @Rahgooy might have comments.

kordjamshidi avatar Aug 04 '17 17:08 kordjamshidi

I think it is good for training a single model, but when we want to train multiple models, let's say with a loop, in that case, the user should wait for the first model to train and then enter [Y/N]. IMO, the better option is to have it as a parameter or something.

Rahgooy avatar Aug 04 '17 18:08 Rahgooy

In fact for jointraining we have the init parameter: here

kordjamshidi avatar Aug 04 '17 18:08 kordjamshidi