Headstart icon indicating copy to clipboard operation
Headstart copied to clipboard

Optional language filter for BASE

Open pkraker opened this issue 8 years ago • 4 comments

Based on feedback received via e-mail, it would useful to add an optional language filter for BASE. We had initially decided against a language filter, as BASE does not have consistent language information. But there are ambiguous terms, where such an optional filter would be useful, e.g. when searching for "maine".

pkraker avatar Sep 18 '17 13:09 pkraker

Your "not have consistent language information" means: there are many ways of implementing this :-) Can you give me a pointer of where to look for the data algorithms need to work with?

steltenpower avatar Sep 27 '17 13:09 steltenpower

We do not have any language detection on our side, because we need to know the language at query time. But this is the script that queries BASE and preprocesses the metadata. You can find a suitable test here.

pkraker avatar Oct 03 '17 17:10 pkraker

How do you mean "BASE does not have consistent language information"? As far as I can remember, you never raised this issue with BASE, @pkraker. Please send me an e-mail detailing the problem.

pietsch avatar Oct 03 '17 19:10 pietsch

@pietsch There are around 37 million documents in BASE where the language is reported as unknown (tested with rbace and bs_search(query="dclang:unknown"). We had not raised this issue with you as we were still evaluating the need for a language filter. This very thread is intended to gather further feedback on the needs and requirements for a language filter.

pkraker avatar Oct 04 '17 09:10 pkraker