Incorrect handling of the unicode queries

Open emorozov opened this issue 15 years ago • 1 comments

When searching like this: Body.search.query(u'привет')

There're always zero results, while command-line search returns hundreds. This is due to double (or even triple) encoding in utf-8 done somewhere in the guts of django-sphinx/sphinxapi.

There're instances of pointless code like unicode(string).encode('utf-8'). The problem is that if string is already a unicode object, this code will create a unicode object containing its utf-8 representation and encode it using utf-8 again thus creating garbage. I've fixed this place in code but the string is sill double-encoded somewhere. :(

This code is pointless anyway because even if it would work - it would be a noop - take a bytestring, convert to unicode, convert to bytestring again. But instead of a useless noop it makes garbage of unicode input.

Sep 11 '10 13:09 emorozov

This patch somewhat mitigate problem by allowing to search using utf-8 strings:

--- models.py.orig  2010-09-11 17:14:01.000000000 +0400
+++ models.py   2010-09-11 17:32:18.000000000 +0400
@@ -289,7 +289,9 @@
         return self._clone(**kwargs)
 
     def query(self, string):
-        return self._clone(_query=unicode(string).encode('utf-8'))
+        if isinstance(string, unicode):
+            string = string.encode('utf-8')
+        return self._clone(_query=string)
 
     def group_by(self, attribute, func, groupsort='@group desc'):
         return self._clone(_groupby=attribute, _groupfunc=func, _groupsort=groupsort)

Sep 11 '10 13:09 emorozov