Support highlighting function in SQL/PPL query engine
Related design for relevance-based search in SQL/PPL engine - https://github.com/opensearch-project/sql/issues/182
TODO List
- [x] Support the
highlightfunction by including it in the search engine - [x] Enable
highlightfunction in SQL syntax and parser, including parameters - [ ] Enable
highlightfunction in PPL syntax and parser, including parameters - [x] Add unit tests
- [x] Add integration tests for
highlighting - [x] Update the user manual - append to existing
matchsearch documentation
Function Details
(reference: https://opensearch.org/docs/latest/opensearch/ux/#highlight-query-matches)
The highlighting function adds search term results to the service response for matched terms in fields. Highlighting only works in tandem with match-search queries. The list of highlighted fields must be present in a relevance-based search function, or else the highlighted fields will result in a syntax error.
Syntax
highlight([field_list, …][, option=<option_value>]*)
Available Options
- pre_tags: per-term tag to embed in the highlighted result, default is
<em> - post_tags: post-term tag to embed in the highlighted result, default is
</em>
Sample query
GET shakespeare/_search
{
"query": {
"match": {
"text_entry": "life"
}
},
"highlight": {
"fields": {
"text_entry": {}
}
}
}
SQL
SELECT highlight("text_entry") as "highlight_text_entry" FROM shakespeare WHERE match("text_entry", "life");
PPL
source=shakespeare | where match("text_entry", "life") | highlight("text_entry")
Sample response
| highlight_text_entry |
| "my <em>life</em>, except my <em>life</em>." |
Looking for input on what should be expected JSON output for the SQL plugin to return from a multi-field highlight query. OpenSearch will respond to a multi-field highlight query with the JSON output I have defined below:
"hits": ... {
"highlight": {
"Field1": [
"highlights <em>hl</em>"
],
"Field2": [
"<p>highlight"
]
}
}
The SQL plugin can handle this response with the nested type allowing for the returned fields to be accessed using the '.' notation. Should the SQL plugin nest all returned fields under a single column shown in the following output:
{
"schema": [
{
"name": "highlight(\"*\")",
"type": "nested"
}
],
"datarows": [
[
{
"Field1": [
"highlights <em>hl</em>"
],
"Field2": [
"<p>highlight"
]
}
],
...
],
}
The SQL plugin could output the JSON format as separate columns for each returned field. Similar to a "SELECT * from ..." SQL query. See the following JSON output for the SQL plugin to output multiple returned fields as multiple columns:
{
"schema": [
{
"name": "highlight(\"*\").field1",
"type": "keyword"
},
{
"name": "highlight(\"*\").field2",
"type": "keyword"
}
],
"datarows": [
[
[
"<em>field 1</em> result 1",
"<em>field 2</em> result 2"
]
],
...
],
}
@penghuo, @dai-chen, @joshuali925 which response structure do you think is more appropriate?
@forestmvey Quick question: If we go for option #2, is it possible there are more than one nested level to flatten? ex. highlight(*).field1.nestField1
@forestmvey Quick question: If we go for option #2, is it possible there are more than one nested level to flatten? ex.
highlight(*).field1.nestField1
@dai-chen I would expect if the data source has fields nested in this way, that this would be possible. Perhaps option #1 would be more predictable in this case.
Yeah, I'm thinking about the same. If we go for option #2, not sure about the complexity and if we may fall back to option 1 in certain case. If we only support simple unnested field case, it should be fine.
@acarbonetto The syntaxhighlight(... -- using a parenthesis -- is different from the syntax of existing PPL commands. Looking at the docs, the different arguments are usually delimited with spaces. Shouldn't highlight be the same?
@dai-chen @joshuali925 @penghuo Here's a quick demo on usage for highlight in SQL and PPL for https://github.com/opensearch-project/sql/pull/827. (NOTE: highlight in PPL is still undergoing design. discussions of PPL syntax can be made here)
https://user-images.githubusercontent.com/36905077/193615927-3d9df7cf-ca4f-41ea-a7e9-4c0c65bc8b60.mp4
@dai-chen @joshuali925 @penghuo Here's a quick demo on usage for highlight in SQL and PPL for #827.
highlight_demo.mp4
@forestmvey Thanks for the work! We may post it in discussion as well like what @MaxKsyunz did? https://github.com/opensearch-project/sql/discussions/850
@dai-chen @joshuali925 @penghuo Here's a quick demo on usage for highlight in SQL and PPL for #827. highlight_demo.mp4
@forestmvey Thanks for the work! We may post it in discussion as well like what @MaxKsyunz did? #850
Here I have posted it, thanks for this: #879
Closing this and track the only remaining item in https://github.com/opensearch-project/sql/issues/916.