Search indexes are being truncated
Description
It looks like search indexes are being truncated. During testing we noticed some keywords weren't matching results. After a reindex and digging in we saw that the searchindex table seems to max out at a consistent ~2328 characters, which excludes some content on sizable entries. The entries we noticed this on specifically are using rich text blocks inside a Matrix grid in case that is relevant.
SELECT length(keywords) AS characters
FROM craft_searchindex
ORDER BY characters DESC
Steps to reproduce
- Create lengthy content
- Search for keywords deep in the text
Additional info
- Craft version: Craft Pro 3.0.15
- PHP version: 7.2.2
- Database driver & version: PostgreSQL 9.6.3
- Plugins & versions: N/A
Yeah Craft does that intentionally to avoid index row size exceeds maximum for index errors. If search is a critical feature of your site, you might want to look into alternative search options that don’t rely on the database, like Algolia.
I'm happy to dig into this a little more but it appears the vector is being processed after the keywords are truncated? Since the former will be much shorter can that be tweaked to allow more vector keywords? Looks like PostgreSQL 9.6 added some new phrase matching options to vectors as well that might be helpful if/when 9.5 is no longer supported, that could be helpful for exact phrase searching.
Is this issue closed or open? Will search indexing improve within the core?
Further, I can except there is a limit, but is there any way to tell what the limit is in exact terms? For example over a certain number of words in an entry or something like that? When can we expect or look for as far as limits? What is "lengthy content" in the OP mean?
For MySQL, the keywords are truncated to 62,259 bytes (ceil of 65,535 ⨉ 0.95), to ensure they fit within a TEXT column. For Postgres, they are truncated to 2,328 bytes (ceil of 2450 ⨉ 0.95), to ensure they can be indexed by the B-TREE index. (And keywords are stored per-field.)
It’s on our list to look into whether this can be improved, which is why we’ve left the ticket open.
Hi team,
What are the current limits for Postgres?
I have lengthy articles that are not being indexed correctly.
@MarvinZ See my previous comment.