Update query broken when document contains single nested child documents
Solarium version(s) affected: 6.1.6 Solr version: 8.5.2
Description
Originally posted by @jupevi in https://github.com/solariumphp/solarium/issues/954#issuecomment-1095214385
Pull request https://github.com/solariumphp/solarium/pull/954 broke update queries containing documents with single nested child documents (i.e. associative arrays), because it always treats arrays in field values as sequential and discards their keys.
Relevant lines in the pull request: https://github.com/solariumphp/solarium/pull/954/files#diff-55c0919de6efccffaca331e4cd644e37dd4fd7deae7f17413e21b6320ab36fa3R251-R256
Before this change, this payload worked fine:
{
"id": 1,
"manufacturer": {
"docType": "Manufacturer",
"id": 2,
"name": "Apple"
}
}
After this change, the payload gets changed to the following before sending to Solr:
{
"id": 1,
"manufacturer": [
"Manufacturer",
2,
"Apple"
]
}
How to reproduce
require __DIR__ . '/vendor/autoload.php';
$data = json_decode('{
"id": 1,
"manufacturer": {
"docType": "Manufacturer",
"id": 2,
"name": "Apple"
}
}', true);
$doc = new \Solarium\QueryType\Update\Query\Document($data);
var_dump($doc->getFields());
Output:
array(2) {
["id"]=>
int(1)
["manufacturer"]=>
array(3) {
[0]=>
string(12) "Manufacturer"
[1]=>
int(2)
[2]=>
string(5) "Apple"
}
}
As you can see, the array keys have been discarded.
Before version 6.1.6 the output was correct:
array(2) {
["id"]=>
int(1)
["manufacturer"]=>
array(3) {
["docType"]=>
string(12) "Manufacturer"
["id"]=>
int(2)
["name"]=>
string(5) "Apple"
}
}
Seems like we also need some more test coverage for nested documents.
The obvious solution is moving $filterControlCharacters elsewhere and let the document be a document.
The most user-friendly would be making it an option of the Update query. That makes it easy to set to false if anyone should need to. The logical place for the actual filtering is in the request builder because it's XML specific. The request builder can get the option from the query.
This is taking longer than expected to fix. With the document getting it mostly right, the request builder still gets it wrong.
EDIT: The request builder only got it wrong because I got the document not quite entirely right with my first attempt.
Before this change, this payload worked fine:
@jupevi I'm curious what you mean by "worked fine". You were able to set/get it on Document, but were you also able to have it actually indexed correctly with Solarium?
The behaviour of Document has been fixed in PR #994. However, I haven't been succesful in actually indexing it as a labelled single nested child document. Depending on how I build the XML update command, I end up with an anonymous child or an array of child documents that contains just the one child.
I have asked on the user list: https://lists.apache.org/thread/6qw3k7mjb09r86f3tkqjw3wzlzw7s62q. From a cursory look at the Solr source code, I suspect it might not be possible at all in XML though.
@thomascorthals Thank you for looking into this, and fixing it so swiftly. Much appreciated.
Yes, by "worked fine" I meant it was both indexed and returned by Solr as a named single child document. Indexing an array with a single child document results in Solr also returning the "field" as an array when queried, so there's a clear distinction between the two indexing payloads.
Unfortunately I can't comment on XML indexing as JSON is the only one I've worked with.
@jupevi Unless something turns up on the user list that I've overlooked, I'm afraid I won't be able to get this to work 100% with Solarium (which always uses XML for any update payload).
For now, I'm going to go with indexing a single child as an array of nested documents because at least that keeps the associated label intact.
I might be able to come up with an ugly workaround. I prefer doing that in a separate PR though.
@thomascorthals I was not aware Solarium always uses XML updates. Do note that I've been and am currently indexing a single nested child document using Solarium, so this does work in versions prior to 6.1.6. That would imply that it's definitely possible to index using XML.
I would like to avoid accepting a solution where the single child is forcibly coerced to an array, since that was not the case before. I would need to add separate logic for extracting known single children from query results, which is just unnecessary complexity and maintenane on my part.
@jupevi Against which Solr version?
I was testing against 8.11.
8.5.2 as in the original comment. Could it be that this has been broken in some Solr update?
@jupevi Do you have a working example of this? What's in the schema regarding nested docs? I tried with Solarium 6.1.5 against Solr 8.5.2 (techproducts example with _nest_path_ added to the schema) and I can't get it to index at all.
Have this in the schema:
<!-- points to the root document of a block of nested documents. Required for nested
document support, may be removed otherwise
-->
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<fieldType name="_nest_path_" class="solr.NestPathField" />
<field name="_nest_path_" type="_nest_path_" />
Put this in the examples directory:
<?php
require_once(__DIR__.'/init.php');
$client = new \Solarium\Client($adapter, $eventDispatcher, $config);
$update = $client->createUpdate();
$data = json_decode('{
"id": 1,
"manufacturer": {
"docType": "Manufacturer",
"id": 2,
"name": "Apple"
}
}', true);
$doc = new \Solarium\QueryType\Update\Query\Document($data);
var_dump($doc->getFields());
$update->addDocument($doc);
$client->update($update);
Got this back:
array(2) {
["id"]=>
int(1)
["manufacturer"]=>
array(3) {
["docType"]=>
string(12) "Manufacturer"
["id"]=>
int(2)
["name"]=>
string(5) "Apple"
}
}
PHP Fatal error: Uncaught Solarium\Exception\HttpException: Solr HTTP error: OK (400)
{
"responseHeader":{
"status":400,
"QTime":0},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","org.apache.solr.common.SolrException"],
"msg":"ERROR: [doc=1] unknown field 'manufacturer'",
"code":400}}
in C:\repos\solarium\src\Core\Query\Result\Result.php:73
Stack trace:
#0 C:\repos\solarium\src\Core\Client\Client.php(771): Solarium\Core\Query\Result\Result->__construct(Object(Solarium\QueryType\Update\Query\Query), Object(Solarium\Core\Client\Response))
#1 C:\repos\solarium\src\Core\Client\Client.php(801): Solarium\Core\Client\Client->createResult(Object(Solarium\QueryType\Update\Query\Query), Object(Solarium\Core\Client\Response))
#2 C:\repos\solarium\src\Core\Client\Client.php(882): Solarium\Core\Client\Client->execute(Object(Solarium\QueryType\Update\Query\Query), NULL)
#3 C:\repos\solarium\examples\nest.php(20): Solarium\Core\Client\Client->update(Object(Solarium\QueryType\Update\Query\Query))
#4 {main}
thrown in C:\repos\solarium\src\Core\Query\Result\Result.php on line 73
@thomascorthals Same fields in my schema, except I also have the _nest_parent_ field.
<fieldType name="nest_path" class="solr.NestPathField" />
<!-- points to the root document of a block of nested documents. Required for nested
document support, may be removed otherwise
-->
<field name="_root_" type="string" indexed="true" stored="false" />
<!-- _nest_path_ is populated by Solr automatically with the path of the document in the hierarchy
for non-root documents. This field is optional.
-->
<field name="_nest_path_" type="nest_path" />
<!-- _nest_parent_ is populated by Solr automatically to store the ID of each document’s
parent document (if there is one). This field is optional.
-->
<field name="_nest_parent_" type="string" indexed="true" stored="true" />
Maybe see if that makes a difference? Probably not since it's optional.
I will try to hack up a working example at some point if I have some free time.
@jupevi _nest_parent_ doesn't make a difference for me.
@thomascorthals I tried making a minimal reproduction and ran into the same issue you described in https://github.com/solariumphp/solarium/issues/992#issuecomment-1114559038
I was a bit stumped, until I found out that in 2020 I had actually created a custom query type object, JsonUpdateQuery, which adds support for making JSON updates using Solarium, instead of using XML as you said. Of course, in typical "past me" fashion, I didn't exactly elaborate on why I added this query type in the commit message it was added in, but it might have to do with your finding that XML indexing doesn't support single nested children. Yet another lesson in the endless quest to learn the importance of good commit messages.
Basically, I have created two new classes, JsonUpdateQuery extends Solarium\QueryType\Update\Query\Query and JsonUpdateRequestBuilder extends Solarium\Core\Query\AbstractRequestBuilder. Then after constructing Client I call $client->registerQueryType(Client::QUERY_UPDATE, JsonUpdateQuery::class); to use my custom classes. After adding these classes to my reproduction, indexing works and querying returns the expected result:
/Users/jukka/github/solr-single-document-index-demo/index.php:34:
array(3) {
'id' =>
int(1)
'docType' =>
string(7) "Product"
'manufacturer' =>
array(3) {
'docType' =>
string(12) "Manufacturer"
'id' =>
int(2)
'name' =>
string(5) "Apple"
}
}
/Users/jukka/github/solr-single-document-index-demo/index.php:45:
array(1) {
[0] =>
class Solarium\QueryType\Select\Result\Document#20 (1) {
protected $fields =>
array(3) {
'id' =>
string(1) "1"
'docType' =>
string(7) "Product"
'manufacturer' =>
array(4) {
'docType' =>
string(12) "Manufacturer"
'id' =>
string(1) "2"
'name' =>
string(5) "Apple"
'_nest_parent_' =>
string(1) "1"
}
}
}
}
You can find my reproduction (including the two classes described above) here: https://github.com/jupevi/solr-single-document-index-demo
Sorry about wasting your time by forgetting this crucial detail. I did not have the time to take a detailed look at your proposed solution to this problem, but do you think it's compatible with this approach? I guess the question is, did you forcibly coerce associative arrays into a single element sequential array in your approach or are you letting them past as is?
@jupevi There are actually two problems here.
#994 fixes Document. It is now fully aware of the distinction between regular values and nested documents. Correct setting and getting of single and multiple children is covered 100% in the unit tests. It no longer has (and should never have needed) knowledge of how the request builder will deal with this data.
I've raised SOLR-16183 for the XML limitation and listed it as a known limitation in our docs for now. I've made the request builder forward compatible with my proposal for the XML syntax. As a result a single child will currently be indexed anonymously. Coercing it to an array would set it up for a future BC break if Solr ever implements that syntax. I believe my argumentation is irrefutable. And it's at least an "official source" to point to as long as they don't dismiss the issue.
A JSON based approach will be the only way to get this to work. Didn't want to tackle this in a single PR.
@jupevi Solarium 6.2.5 has been released with the fix for the actual bug in the way Document handled single nested children.
This release also includes important groundwork for JSON updates. It implements JsonSerializable on Documents with support for atomic updates.
we experience breakage between 6.2.3 and 6.2.6
found this issue :)
Solr 7, our logs are screaming 400 bad request:
Remote error message: Unable to index docs with children:
the schema must include definitions for both a uniqueKey field and the '_root_' field, using the exact same fieldType
@ro0NL Do you have a reproducible example that you can share?
Sorry about the late reply, I've been busy with other projects.
Thank you so much for providing the fix, I have validated that 6.2.6 works for our use case.
Too bad about the side effect reported in https://github.com/solariumphp/solarium/issues/992#issuecomment-1247666653. Hard to guess what the issue could be without an example though.
Should this issue be closed and a new one created for that?
we've fixed the version to 6.2.3 for now, feel free to close
this url is in our git log :)
@jupevi Solarium 6.2.8 was released with native support for JSON formatted update requests. For now it's opt-in to allow users to test it out, it'll become the default with the next major release.
Do you mind test-driving it for nested documents?
$update = $client->createUpdate();
$update->setRequestFormat($update::REQUEST_FORMAT_JSON);
@thomascorthals Sorry for the very late reply, I haven't had a chance to work on this for a while.
Yesterday, I updated to 6.3.1 and replaced my jury-rigged json-update query type with your suggestion:
$update = $client->createUpdate();
$update->setRequestFormat($update::REQUEST_FORMAT_JSON);
So far, testing this in my local development environment and our sandbox environment I haven't found any issues. For our use-case, JSON indexing works correctly, including nested documents.
Thank you for your work!
@jupevi
Yesterday, I updated to 6.3.1
You don't even have to set it explicitly. Solarium 6.3 uses the JSON request format by default.