Optional language filter for BASE
Based on feedback received via e-mail, it would useful to add an optional language filter for BASE. We had initially decided against a language filter, as BASE does not have consistent language information. But there are ambiguous terms, where such an optional filter would be useful, e.g. when searching for "maine".
Your "not have consistent language information" means: there are many ways of implementing this :-) Can you give me a pointer of where to look for the data algorithms need to work with?
We do not have any language detection on our side, because we need to know the language at query time. But this is the script that queries BASE and preprocesses the metadata. You can find a suitable test here.
How do you mean "BASE does not have consistent language information"? As far as I can remember, you never raised this issue with BASE, @pkraker. Please send me an e-mail detailing the problem.
@pietsch There are around 37 million documents in BASE where the language is reported as unknown (tested with rbace and bs_search(query="dclang:unknown"). We had not raised this issue with you as we were still evaluating the need for a language filter. This very thread is intended to gather further feedback on the needs and requirements for a language filter.