zhparser icon indicating copy to clipboard operation
zhparser copied to clipboard

分词结果缺失

Open day210 opened this issue 2 years ago • 2 comments

不知道是因为忽略了停止词还是什么原因,会丢失结果,比如"批量处理"的分词结果只有"处理"这个词,"自动提交"的结果也只有"提交",然而在调试分词结果的网页里测试是正常的。

day210 avatar Jun 19 '23 05:06 day210

我也发现了同样的问题,例如输入: select * from to_tsvector('testzhcfg', '我将沉入超凡的黑暗') 却返回了 : '沉入':1 '黑暗':2 按理来说超凡应该是形容词,不应该被当做停用词被过滤掉。

我尝试执行调试命令: ts_debug('testzhcfg', '我将沉入超凡的黑暗') image

注意结果中'超凡'单词所在的行,它被当做了一个未知词,别名是 x (并且字典那一列是空,没有匹配到可用字典)。 再来看看示例中提供的配置命令 ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple 其中对 只n,v,a,i,e,l 六种词做了映射,使用了simple字典。但并没有包含 x (停用词),这导致了x (停用词) 在处理时匹配不到可用的字典。可以使用select ts_token_type('zhparser') 查看有哪些类型的词

也就是说再ADD MAPPING时加上这个 x (停用词)就解决了这个问题 (不过这不固定,你得使用ts_debug确认那个没有被切分出来的词是什么类型,然后把这个加入到mapping里)。但是这样可能会引入那些不需要的词,我不确定这样做会有什么问题

IZ-ONE avatar Aug 26 '24 09:08 IZ-ONE

@IZ-ONE work for me

immich=# select *
from ts_debug('Baidu百度');
 alias |    description     | token | dictionaries | dictionary | lexemes 
-------+--------------------+-------+--------------+------------+---------
 e     | exclamation,感叹词 | Baidu | {simple}     | simple     | {baidu}
 m     | numeral,数词       | 百度  | {}           |            | 
(2 rows)

immich=# SELECT to_tsvector('zhcfg', 'Baidu百度');
 to_tsvector 
-------------
 'baidu':1
(1 row)

select to_tsvector('chinese', 'Baidu百度') @@ to_tsquery('chinese', '百度'); -- >> false <<
drop text search configuration zhcfg cascade;
CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION chinese ALTER MAPPING FOR n,v,a,i,e,l,m,x WITH simple;
set default_text_search_config=chinese;
select to_tsvector('chinese', 'Baidu百度') @@ to_tsquery('chinese', '百度'); -- >> true <<

对于我来说,最终使用了 websearch_to_tsquery 来将简单的未格式化文本作为有效查询


-- SearchRepository.searchFuse
with
  "ranked_assets" as (
    select
      "assets".*,
      COALESCE(
        ts_rank(
          to_tsvector(f_unaccent (ocr_info.text)),
          websearch_to_tsquery(f_unaccent ($1))
        ),
        0
      ) as "text_rank",
      1 - (smart_search.embedding <=> $2) as "vector_rank"
    from
      "assets"
      inner join "exif" on "assets"."id" = "exif"."assetId"
      left join "smart_search" on "assets"."id" = "smart_search"."assetId"
      left join "ocr_info" on "assets"."id" = "ocr_info"."assetId"
    where
      "assets"."fileCreatedAt" >= $3
      and "exif"."lensModel" = $4
      and "assets"."ownerId" = any ($5::uuid[])
      and "assets"."isFavorite" = $6
      and "assets"."isArchived" = $7
      and "assets"."deletedAt" is null
  )
select
  "ranked_assets".*,
  (0.4 * vector_rank + 0.6 * text_rank) as "combined_rank"
from
  "ranked_assets"
order by
  "combined_rank" desc
limit
  $8
offset
  $9

eric-gitta-moore avatar May 07 '25 14:05 eric-gitta-moore