git-wiki-theme icon indicating copy to clipboard operation
git-wiki-theme copied to clipboard

Searchdata file is referencing "page.content" instead of "post.content"

Open razvii22 opened this issue 11 months ago • 3 comments

Describe the bug Commit 4d8c224c5999861cb41c41b5d963907c48459cc9 replaced post.content with page.content, which causes the "content" field of posts to reference the searchdata page. This notably causes posts to appear when keywords from the searchdata template file are used to search.

To reproduce Steps to reproduce the behavior:

  1. Use the built in Javascript search engine
  2. Upload any posts to the _posts directory
  3. Check the generated searchdata.js file in _site if local OR
  4. Attempt a search in any page that uses symbols or keywords from the searchdata.js file, the example below is directly from https://www.drassil.org/git-wiki/main_page

Image

Expected behavior Search should match the actual content of posts, not the code of the template file.

Screenshots

Image Image

Image Image

I am not aware if there are any issues that stem from "post" being used, but it looks like a mistake while implementing using jsonify.

razvii22 avatar May 07 '25 10:05 razvii22

I'd also like to mention that (when working as expected as shown in the latter screenshots) the HTML is included in the searchdata content as the referenced commit apparently removed the strip_html Liquid filter.

Tbh I wasn't aware of that filter until seeing this issue but leaving the HTML in the search index leads to unnecessary bloat. It's something I dealt with for our wiki by building the search index via a shell script instead of via Liquid, to obtain the raw Markdown file content then with custom filtering removing all undesired markup via JS conditionally.

Just using the raw Markdown compared to using the default searchdata loop which used the parsed HTML took that search index down from 3.8MB to 990KB (raw size, before gzipping).

chocmake avatar May 20 '25 14:05 chocmake

I'd also like to mention that (when working as expected as shown in the latter screenshots) the HTML is included in the searchdata content as the referenced commit apparently removed the strip_html Liquid filter.

Tbh I wasn't aware of that filter until seeing this issue but leaving the HTML in the search index leads to unnecessary bloat. It's something I dealt with for our wiki by building the search index via a shell script instead of via Liquid, to obtain the raw Markdown file content then with custom filtering removing all undesired markup via JS conditionally.

This seemed wrong to me because I removed the content field entirely from our searchdata, since it seemed to actually make search worse, so I hadn't looked at the results in a while. But sure enough, post.content | jsonify returns the HTML, the reason I was confused about this is that page.content | jsonify actually returns back the source markdown for some reason?

This whole thing seems messy to me in general so I chose to not even include the content field anymore. For this to work you'd have to add the strip.html filter before jsonify for posts and you can markdownify | strip_html for pages. The following content filters... work at least, if you're willing to modify your searchdata.js:

Posts:

"content"  : {{ post.content | strip_html | jsonify}}

Pages:

"content"  :  {{ page.content | markdownify | strip_html | jsonify}}

It's hacky and awkward but it does what it's meant to do, I would also suggest maybe throwing in a truncate filter to avoid your searchdata being literally as big as your whole wiki. I personally don't see the point in having the content field there as without a mechanism to display what the search engine matched in the page, it ends up making the search results really messy.

razvii22 avatar Jun 02 '25 09:06 razvii22

I personally don't see the point in having the content field there as without a mechanism to display what the search engine matched in the page, it ends up making the search results really messy.

Just as an aside, it's worth trying a different search library than git-wiki's default, such as Fuse.js or Lunr.js. I did so using Fuse.js and it was straightforward telling it to use the existing key values from the searchdata.js output.

In our case we needed full text search (and because some of it will be presented via JS and not Jekyll) so keeping non-truncated content text was useful but if that wasn't necessary then trimming it is a good call.

chocmake avatar Jun 03 '25 13:06 chocmake