Make category search non case-sensitive and more user friendly
Received a report from a 2.11 user on our FB page that category search is case-sensitive for her, which means that sometimes she'll type the right category in the search field but nothing will show up.
AFAIK the MW API that we use is inherently case-sensitive, but the upload wizard seems to be able to find a way around that and produces the same category suggestions regardless of case.
Edit: Apart from the case sensitivity, the allcategories API also has a problem of doing a prefix match. This does not give a great UX. We should explore ways to fix this too.
Maybe if we convert everything to lowercase then the server performs a non-case-sensitive search? That's just an hypothesis, I have not tried.
@nicolas-raoul Possible! We'll try it out with a direct query first.
Can I take this issue?
@ankit-kumar-dwivedi please feel free!
@misaochan Is this issue free to be worked upon? if so can i take it?
Hey! Yes sure you should start working on it as I'm not working on it right now.
On Sun, Jan 12, 2020, 7:21 PM Kshitij Bhardwaj [email protected] wrote:
@misaochan https://github.com/misaochan Is this issue free to be worked upon? if so can i take it?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/commons-app/apps-android-commons/issues/3179?email_source=notifications&email_token=AI7ACH2SIIJZ5DQVUZWKGVDQ5MN6ZA5CNFSM4JBW444KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIW2OWQ#issuecomment-573417306, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7ACHYZUMAZFWYYVHDK77LQ5MN6ZANCNFSM4JBW444A .
Thank you:)
I'm re-opening this as I believe there's a problem with how this issue was fixed.
@kbhardwaj123 Can you clarify a doubt that I have regarding your PR #3326? In the description you say:
Tested from the MW api fuzzy search url that the category suggestions would deliver the desired results no matter what case you sent in the api call so to fix the issue api call has been converted to lower case.
Are you sure the API really doesn't care about the case of the category name given to it? I'm doubtful about that for several reasons. Here are a couple:
- The logical one: If the API really doesn't care about the case of the search text sent to it, this issue shouldn't exist to begin with. Right? IOW, if the API is returning us all categories that match a search text despite the case in which we send the query, then there's no point in just lower-casing the search text we send to the API. Got my point? But the mere existence of this issue indicates otherwise. Correct me if I'm missing something.
-
The practical one: I just checked with a couple of API calls and I get different results based on the case of the search text I send to the query. Here are a couple of queries which return different results despite only the case of the search text differing:
- https://commons.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&generator=allcategories&gacprefix=covid&gaclimit=25&gacoffset=0
- https://commons.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&generator=allcategories&gacprefix=COVID&gaclimit=25&gacoffset=0 FYI: AFAIK, we use the same API to query the categories in the app.
In case you're wondering why the test case didn't fail. Here's the catch:
The page title is case-sensitive except the first character.
From Manual:Page title - MediaWiki
I think the quote speaks for itself. I'll share the actual problem w.r.t to the app in the next comment.
@sivaraam while I was working on this I went with what @nicolas-raoul suggested so I ensured that all the category strings being passed to the OkHttpClient are converted to lower case and I wrote new tests regarding that and they worked fine but I guess I must have missed something I will take a look at at again
Ok. Here's the issue with the respect to the app: category search doesn't return any categories with a prefix that has a upper case character in it (other than the first one, of course). See #3582 for proof.
In case the issue is not clear to you from #3582, here's another example.
Here's what I get when I search for categories with "COVID" (mind the case) in the app (version: 2.12.3.629~a63a358):

Now, consider the linked example query which returns 25 categories which have "COVID" as it's prefix. Here are the categories that the query returns:
Category:COVID-19 guidelines in Brazil
Category:COVID-19 guidelines in Argentina
Category:COVID-19 guidelines in Albania
Category:COVID-19
Category:COVID-19 guidelines by country
Category:COVID-19 guidelines in Czechia
Category:COVID-19 guidelines
Category:COVID-19 guidelines in Denmark
Category:COVID-19 guidelines in Esperanto
Category:COVID-19 Clinical Cohort Research Conference, March 18, 2019, National Medical Center, Republic of Korea
Category:COVID-19 coronavirus
Category:COVID-19 guidelines by language
Category:COVID-19 guidelines in Arabic
Category:COVID-19 guidelines in English
Category:COVID-19 guidelines in Basque
Category:COVID-19 guidelines in China
Category:COVID-19 guidelines in Estonian
Category:COVID-19 guidelines in Bengali
Category:COVID-19 guidelines in Bangladesh
Category:COVID-19 guideline cartoons by Anika Nawar Eeha and Abdullah Al Mamun in Bengali
Category:COVID-19 guidelines in Bengali by Anika Nawar Eeha and Abdullah Al Mamun
Category:COVID-19 guidelines in Catalan
Category:COVID-19 guidelines in East Timor
Category:COVID-19 DIY
Category:COVID-19 guidelines in Breton
As you can see, none of the above categories are shown in the category suggestions.
@misaochan Given that we've now accidentally reduced the category search space rather than increasing it, you might want to ensure we fix this before releasing the next version.
Added to the release list, thanks for the heads up!
Hi @kbhardwaj123 , are you currently still working on this? Please do keep us updated, thanks!
@misaochan sure I'm on it, will update ASAP
Thanks @kbhardwaj123 ! As we are planning to include this in v2.13, when you submit your PR could you please rebase and submit it on the 2.13-release branch?
I investigated about the problem and here are my findings.
- thanks @sivaraam for such in depth details, you are right the API is indeed case sensitive
- We have 2 search options available at our disposal we can search by prefix or we can use the normal search (in between search, is case-insensitive)
So Suppose I want to find the category Temple of Ishtar at Mari by entering temple of ishtar these are the results using
- using
generator=allcategoriesin the url API result which apparently is for prefix search and is case-sensitive - using
generator=searchin the url we get API result which is case in-sensitive and gives us the required Temple of Ishtar at Mari category.
Now on reading the logs i realized that the method searchAll() in CategoriesModel was calling for prefix search and that right there is where the problem is, so i when i fix that by calling both prefix and search API and combining the result we finally get a case insensitive search.
But there's a catch We are using the beta flavor of the APIs which give the following results
- prefix result API Link which doesn't give the required result category (this was expected)
- search result API result, now here we expected Temple of Ishtar at Mari right? but it seems like the beta flavor of the API is unable to give the required result
Possible Solution AFAIK there are two ways
- (efficient)Use the production version of the commons api which are capable of delivering case-insensitive results by themselves
- Or make multiple API calls to the existing API by manipulating the search string
@misaochan @sivaraam @nicolas-raoul @maskaravivek I need your opinions on my investigation on this to fix it for v2.13 i mean are we going to use the production flavor of the APIs in v2.13
@kbhardwaj123 Thanks for the analysis. I'll look into it and share my comments soon. I have a quick doubt about one particular thing:
We are using the beta flavor of the APIs which give the following results
What do you mean by beta flavor of API? Do you mean the API hosted in the beta server (https://commons.wikimedia.beta.wmflabs.org/w/api.php) as opposed to the production server (https://commons.wikimedia.org/w/api.php)?
For category search (and really any testing that does not involve actually uploading), please use the prodDebug flavor of the app. The beta server is unusable for most testing.
@sivaraam yes that's exactly what i meant the API hosted on beta server https://commons.wikimedia.beta.wmflabs.org/w/api.php has server API but it is unable to give the required result where as the production API https://commons.wikimedia.org/w/api.php gives the expected result as shown bu the links in by previous comment. @nicolas-raoul does prodDebug flavor use https://commons.wikimedia.org/w/api.php APIs ?
@kbhardwaj123 Yes, prod* flavors use the production APIs, for instance https://commons.wikimedia.org/w/api.php . Sorry that our beta servers are not representative of production :'-(
@nicolas-raoul sure then the problem is solved already, I will create the pull request
@sivaraam yes that's exactly what i meant
Thanks for the clarification.
... the API hosted on beta server commons.wikimedia.beta.wmflabs.org/w/api.php has server API but it is unable to give the required result where as the production API commons.wikimedia.org/w/api.php gives the expected result as shown bu the links in by previous comment.
Let me now clarify something. The API hosted in the beta servers and the production servers would not differ in a functional manner. I'm glossing over a little but I believe it's fine for the case in question. You see different results when you use the beta server only because all the categories present that are present in the prod server are not present in the beta server. So, forget the fact that the APIs in the production servers are case-sensitive and the APIs in the beta servers are not, because both of them behave the same way. You can read more about the beta cluster in the following wiki page: Beta Cluster - MediaWiki
sure then the problem is solved already, I will create the pull request
Can you explain how you're going to fix this? I'm asking this to ensure everyone's on the same page. Also, I would suggest you to not rush this. I say this because making the category search case insensitive seems to be a lot complicated than it seems. It's better to know our options and choose the most appropriate one. If we have the release coming up soon soon we can always just revert the changes done in PR #3326 (which we would have to do anyway) and move with the release. We can then make the change after that in that case. @misaochan can comment better about the deadline.
You see different results when you use the beta server only because all the categories present that are present in the prod server are not present in the beta server. So, forget the fact that the APIs in the production servers are case-sensitive and the APIs in the beta servers are not, because both of them behave the same way.
Ok. Here's a proof for the fact that the Beta server behaves just the same way as the production server.
https://commons.wikimedia.beta.wmflabs.org/w/api.php?action=query&format=json&formatversion=2&generator=search&gsrnamespace=14&gsrsearch=testcat&gsrlimit=25&gsroffset=0
This returns the Category:TestCat despite the search term being testcat. So, the beta server's generator=search is case-insensitive too.
Let me now clarify something. The API hosted in the beta servers and the production servers would not differ in a functional manner. I'm glossing over a little but I believe it's fine for the case in question. You see different results when you use the beta server only because all the categories present that are present in the prod server are not present in the beta server. So, forget the fact that the APIs in the production servers are case-sensitive and the APIs in the beta servers are not, because both of them behave the same way. You can read more about the beta cluster in the following wiki page: Beta Cluster - MediaWiki
@sivaraam initially what i meant was that since beta servers don't have all the categories ( the working of both APIs is same that was clear from their documentation) this is what i wanted to show:
Suppose i want category Temple of Ishtar at Mari by typing temple of ishtar only
If the prodDebug APIs are used they give what one expects:
-
with
generator = allcategoriesresult it doesn't give the required category because it has prefix only search -
with
generator = searchresult gives the required category due to it's case insensitivity
But the beta servers have a problem which is that they contain the category Temple of Ishtar at Mari using generator=allcategories see here but The generator=search is incapable of returning the category when provide it with temple of ishtar see here
Comprehensively the results displayed by the beta server's case sensitive API (generator=allcategories) is delivering a category which the case-insensitive API is not able to return and **no such problem is there in the prodDebug APIs
How i intend to solve this is that the searchAll method is the one at fault here, it only calls the prefixSearch API for searching categories so we we make a call using generator=search and combine both prefix and normal search results our problem would be solved.
And yes we need to use the prodDebug APIs because of the point i just mentioned above.
@sivaraam yes i agree that the beta ones are case insensitive but they don't seem to return Category: Temple of Ishtar at Mari (using generator=search) while the case insensitive beta API (generator=allcategories) return the category see result
So i implemented the solution and with the beta server's APIs and this is how it looks with screenshots
using category suggested by @sivaraam Category:TestCat

Now with Category:Temple of Ishtar at Mari (here i am showing that it exists on beta server):

But from the following screenshot it is visible that generator=search doesn't return that which leaves this category as case sensitive

And as soon as i change the flavor of the APIs to prodDebug all these problems dissappear
@kbhardwaj123 Thanks for your explanations. I see your problem now.
But from the following screenshot it is visible that generator=search doesn't return that which leaves this category as case sensitive
It's prudent to explore more before coming to conclusions. AFAIK, you can't just make some categories case sensitive and others case insensitive. It doesn't even make any sense, does it? Anyways, I'll try to clarify what's going on here. Here's the description of the search API from API:Search - MediaWiki [emphasis mine]:
GET request to search for a title or text in a wiki.
Just assume the search API does not search for the titles for now, I'll come back to the why such an assumption? later. Note that the search API looks for the text in the wiki pages. So, any query you send to generator=search looks for the search text in the contents of the wiki page (the category pages are the wiki pages, in our case). So, the results you get in the beta and production server depend not just on the presence of the categories it also are based on what content is present in the category pages. Let's take your case of the "Category:Temple of Ishtar at Mari".
-
Category:Temple of Ishtar at Mari - Wikimedia Commons This is the category page in the production server. As you can see the category page has content such as info boxes which has your search text "temple of Ishtar". This could have been the reason for that category to be included in the your API query to the production server: query link.
Also, the fact that the
searchAPI searches the content is very clear from the results of the above query which include categories such asCategory:Astarte (goddess),Category:Passing lion Babylon (Louvre, AO21118) -
Category:Temple of Ishtar at Mari - Wikimedia Commons BETA This is the category page in the beta server. Don't get tricked on seeing the 'Media in category "Temple of Ishtar at Mari"' and presume it's text present in that category page. It's a standard section shown for all category pages showcasing the pictures that belong to that category. The category page itself doesn't exist yet. So, no wonder the search query you sent to the beta cluster didn't return this category as a result: query link.
I'm not very sure about how/why a page is included in the result as the algorithm seems to be more involved. Relevant quote from the "Additional notes" section in API:Search page page:
Depending on which search backend is in use, how srsearch is interpreted may vary. On Wikimedia wikis which use CirrusSearch, see Help:CirrusSearch for information about the search syntax.
Coming to the why assume search API doesn't search the title part. Try the following query:
https://commons.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&generator=search&gsrnamespace=14&gsrsearch=temple%20of%20ishtar&gsrlimit=25&gsroffset=0&gsrwhat=title
I've just added gsrwhat=title to the query which tells it to search just the title. As you can see it would clearly say: "title" search is disabled.. Thus my assumption. See also: https://stackoverflow.com/q/14337219/5614968
I hope I've clarified your confusion about beta server not returning the results you expect, now. Let me know if I have not.
To conclude, the search API does more than what's needed (a category title search) and particularly doesn't seem to be searching the title at all. I don't think that would be a good choice. So, as I mentioned earlier we'll have to explore the proper way to achieve a case insensitive search. Here are a couple of related API pages:
Also, I believe we could ask the wikitech-l mailing list about this.
@sivaraam Thanks for such a comprehensive explanation :).
I agree with you that search generator could be an overkill as you pointed out that searching temple of ishtar returns some completely unrelated categories as they contain that term in their wiki text body. So what i am thinking is that in the question on stackoverflow which you mentioned one person gave a workaround of using intitle as:
srsearch=intitle:temple%20of%20ishtar
could solve our issue and return only those categories with the required search term.
Kindly give your opinions on this