solarium icon indicating copy to clipboard operation
solarium copied to clipboard

Update query broken when document contains single nested child documents

Open jupevi opened this issue 3 years ago • 19 comments

Solarium version(s) affected: 6.1.6 Solr version: 8.5.2

Description

Originally posted by @jupevi in https://github.com/solariumphp/solarium/issues/954#issuecomment-1095214385

Pull request https://github.com/solariumphp/solarium/pull/954 broke update queries containing documents with single nested child documents (i.e. associative arrays), because it always treats arrays in field values as sequential and discards their keys.

Relevant lines in the pull request: https://github.com/solariumphp/solarium/pull/954/files#diff-55c0919de6efccffaca331e4cd644e37dd4fd7deae7f17413e21b6320ab36fa3R251-R256

Before this change, this payload worked fine:

{
  "id": 1,
  "manufacturer": {
    "docType": "Manufacturer",
    "id": 2,
    "name": "Apple"
  }
}

After this change, the payload gets changed to the following before sending to Solr:

{
  "id": 1,
  "manufacturer": [
    "Manufacturer",
    2,
    "Apple"
  ]
}

How to reproduce

require __DIR__ . '/vendor/autoload.php';
$data = json_decode('{
    "id": 1,
    "manufacturer": {
        "docType": "Manufacturer",
        "id": 2,
        "name": "Apple"
    }
}', true);
$doc = new \Solarium\QueryType\Update\Query\Document($data);
var_dump($doc->getFields());

Output:

array(2) {
  ["id"]=>
  int(1)
  ["manufacturer"]=>
  array(3) {
    [0]=>
    string(12) "Manufacturer"
    [1]=>
    int(2)
    [2]=>
    string(5) "Apple"
  }
}

As you can see, the array keys have been discarded.

Before version 6.1.6 the output was correct:

array(2) {
  ["id"]=>
  int(1)
  ["manufacturer"]=>
  array(3) {
    ["docType"]=>
    string(12) "Manufacturer"
    ["id"]=>
    int(2)
    ["name"]=>
    string(5) "Apple"
  }
}

jupevi avatar Apr 12 '22 08:04 jupevi

Seems like we also need some more test coverage for nested documents.

mkalkbrenner avatar Apr 13 '22 10:04 mkalkbrenner

The obvious solution is moving $filterControlCharacters elsewhere and let the document be a document.

The most user-friendly would be making it an option of the Update query. That makes it easy to set to false if anyone should need to. The logical place for the actual filtering is in the request builder because it's XML specific. The request builder can get the option from the query.

thomascorthals avatar Apr 13 '22 16:04 thomascorthals

This is taking longer than expected to fix. With the document getting it mostly right, the request builder still gets it wrong.

EDIT: The request builder only got it wrong because I got the document not quite entirely right with my first attempt.

thomascorthals avatar Apr 26 '22 09:04 thomascorthals

Before this change, this payload worked fine:

@jupevi I'm curious what you mean by "worked fine". You were able to set/get it on Document, but were you also able to have it actually indexed correctly with Solarium?

The behaviour of Document has been fixed in PR #994. However, I haven't been succesful in actually indexing it as a labelled single nested child document. Depending on how I build the XML update command, I end up with an anonymous child or an array of child documents that contains just the one child.

I have asked on the user list: https://lists.apache.org/thread/6qw3k7mjb09r86f3tkqjw3wzlzw7s62q. From a cursory look at the Solr source code, I suspect it might not be possible at all in XML though.

thomascorthals avatar May 01 '22 15:05 thomascorthals

@thomascorthals Thank you for looking into this, and fixing it so swiftly. Much appreciated.

Yes, by "worked fine" I meant it was both indexed and returned by Solr as a named single child document. Indexing an array with a single child document results in Solr also returning the "field" as an array when queried, so there's a clear distinction between the two indexing payloads.

Unfortunately I can't comment on XML indexing as JSON is the only one I've worked with.

jupevi avatar May 01 '22 17:05 jupevi

@jupevi Unless something turns up on the user list that I've overlooked, I'm afraid I won't be able to get this to work 100% with Solarium (which always uses XML for any update payload).

For now, I'm going to go with indexing a single child as an array of nested documents because at least that keeps the associated label intact.

I might be able to come up with an ugly workaround. I prefer doing that in a separate PR though.

thomascorthals avatar May 01 '22 19:05 thomascorthals

@thomascorthals I was not aware Solarium always uses XML updates. Do note that I've been and am currently indexing a single nested child document using Solarium, so this does work in versions prior to 6.1.6. That would imply that it's definitely possible to index using XML.

I would like to avoid accepting a solution where the single child is forcibly coerced to an array, since that was not the case before. I would need to add separate logic for extracting known single children from query results, which is just unnecessary complexity and maintenane on my part.

jupevi avatar May 01 '22 19:05 jupevi

@jupevi Against which Solr version?

I was testing against 8.11.

thomascorthals avatar May 01 '22 20:05 thomascorthals

8.5.2 as in the original comment. Could it be that this has been broken in some Solr update?

jupevi avatar May 01 '22 20:05 jupevi

@jupevi Do you have a working example of this? What's in the schema regarding nested docs? I tried with Solarium 6.1.5 against Solr 8.5.2 (techproducts example with _nest_path_ added to the schema) and I can't get it to index at all.

Have this in the schema:

   <!-- points to the root document of a block of nested documents. Required for nested
      document support, may be removed otherwise
   -->
   <field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
   <fieldType name="_nest_path_" class="solr.NestPathField" />
   <field name="_nest_path_" type="_nest_path_" />

Put this in the examples directory:

<?php

require_once(__DIR__.'/init.php');

$client = new \Solarium\Client($adapter, $eventDispatcher, $config);
$update = $client->createUpdate();

$data = json_decode('{
    "id": 1,
    "manufacturer": {
        "docType": "Manufacturer",
        "id": 2,
        "name": "Apple"
    }
}', true);
$doc = new \Solarium\QueryType\Update\Query\Document($data);
var_dump($doc->getFields());

$update->addDocument($doc);
$client->update($update);

Got this back:

array(2) {
  ["id"]=>
  int(1)
  ["manufacturer"]=>
  array(3) {
    ["docType"]=>
    string(12) "Manufacturer"
    ["id"]=>
    int(2)
    ["name"]=>
    string(5) "Apple"
  }
}
PHP Fatal error:  Uncaught Solarium\Exception\HttpException: Solr HTTP error: OK (400)
{
  "responseHeader":{
    "status":400,
    "QTime":0},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"ERROR: [doc=1] unknown field 'manufacturer'",
    "code":400}}
 in C:\repos\solarium\src\Core\Query\Result\Result.php:73
Stack trace:
#0 C:\repos\solarium\src\Core\Client\Client.php(771): Solarium\Core\Query\Result\Result->__construct(Object(Solarium\QueryType\Update\Query\Query), Object(Solarium\Core\Client\Response))
#1 C:\repos\solarium\src\Core\Client\Client.php(801): Solarium\Core\Client\Client->createResult(Object(Solarium\QueryType\Update\Query\Query), Object(Solarium\Core\Client\Response))
#2 C:\repos\solarium\src\Core\Client\Client.php(882): Solarium\Core\Client\Client->execute(Object(Solarium\QueryType\Update\Query\Query), NULL)
#3 C:\repos\solarium\examples\nest.php(20): Solarium\Core\Client\Client->update(Object(Solarium\QueryType\Update\Query\Query))
#4 {main}
  thrown in C:\repos\solarium\src\Core\Query\Result\Result.php on line 73

thomascorthals avatar May 02 '22 07:05 thomascorthals

@thomascorthals Same fields in my schema, except I also have the _nest_parent_ field.

  <fieldType name="nest_path" class="solr.NestPathField" />
  <!-- points to the root document of a block of nested documents. Required for nested
     document support, may be removed otherwise
  -->
  <field name="_root_" type="string" indexed="true" stored="false" />
  <!-- _nest_path_ is populated by Solr automatically with the path of the document in the hierarchy
     for non-root documents. This field is optional.
  -->
  <field name="_nest_path_" type="nest_path" />
  <!-- _nest_parent_ is populated by Solr automatically to store the ID of each document’s
     parent document (if there is one). This field is optional.
  -->
  <field name="_nest_parent_" type="string" indexed="true" stored="true" />

Maybe see if that makes a difference? Probably not since it's optional.

I will try to hack up a working example at some point if I have some free time.

jupevi avatar May 02 '22 10:05 jupevi

@jupevi _nest_parent_ doesn't make a difference for me.

thomascorthals avatar May 02 '22 11:05 thomascorthals

@thomascorthals I tried making a minimal reproduction and ran into the same issue you described in https://github.com/solariumphp/solarium/issues/992#issuecomment-1114559038

I was a bit stumped, until I found out that in 2020 I had actually created a custom query type object, JsonUpdateQuery, which adds support for making JSON updates using Solarium, instead of using XML as you said. Of course, in typical "past me" fashion, I didn't exactly elaborate on why I added this query type in the commit message it was added in, but it might have to do with your finding that XML indexing doesn't support single nested children. Yet another lesson in the endless quest to learn the importance of good commit messages.

Basically, I have created two new classes, JsonUpdateQuery extends Solarium\QueryType\Update\Query\Query and JsonUpdateRequestBuilder extends Solarium\Core\Query\AbstractRequestBuilder. Then after constructing Client I call $client->registerQueryType(Client::QUERY_UPDATE, JsonUpdateQuery::class); to use my custom classes. After adding these classes to my reproduction, indexing works and querying returns the expected result:

/Users/jukka/github/solr-single-document-index-demo/index.php:34:
array(3) {
  'id' =>
  int(1)
  'docType' =>
  string(7) "Product"
  'manufacturer' =>
  array(3) {
    'docType' =>
    string(12) "Manufacturer"
    'id' =>
    int(2)
    'name' =>
    string(5) "Apple"
  }
}
/Users/jukka/github/solr-single-document-index-demo/index.php:45:
array(1) {
  [0] =>
  class Solarium\QueryType\Select\Result\Document#20 (1) {
    protected $fields =>
    array(3) {
      'id' =>
      string(1) "1"
      'docType' =>
      string(7) "Product"
      'manufacturer' =>
      array(4) {
        'docType' =>
        string(12) "Manufacturer"
        'id' =>
        string(1) "2"
        'name' =>
        string(5) "Apple"
        '_nest_parent_' =>
        string(1) "1"
      }
    }
  }
}

You can find my reproduction (including the two classes described above) here: https://github.com/jupevi/solr-single-document-index-demo

Sorry about wasting your time by forgetting this crucial detail. I did not have the time to take a detailed look at your proposed solution to this problem, but do you think it's compatible with this approach? I guess the question is, did you forcibly coerce associative arrays into a single element sequential array in your approach or are you letting them past as is?

jupevi avatar May 03 '22 22:05 jupevi

@jupevi There are actually two problems here.

#994 fixes Document. It is now fully aware of the distinction between regular values and nested documents. Correct setting and getting of single and multiple children is covered 100% in the unit tests. It no longer has (and should never have needed) knowledge of how the request builder will deal with this data.

I've raised SOLR-16183 for the XML limitation and listed it as a known limitation in our docs for now. I've made the request builder forward compatible with my proposal for the XML syntax. As a result a single child will currently be indexed anonymously. Coercing it to an array would set it up for a future BC break if Solr ever implements that syntax. I believe my argumentation is irrefutable. And it's at least an "official source" to point to as long as they don't dismiss the issue.

A JSON based approach will be the only way to get this to work. Didn't want to tackle this in a single PR.

thomascorthals avatar May 09 '22 14:05 thomascorthals

@jupevi Solarium 6.2.5 has been released with the fix for the actual bug in the way Document handled single nested children.

This release also includes important groundwork for JSON updates. It implements JsonSerializable on Documents with support for atomic updates.

thomascorthals avatar Jul 20 '22 12:07 thomascorthals

we experience breakage between 6.2.3 and 6.2.6

found this issue :)

Solr 7, our logs are screaming 400 bad request:

Remote error message: Unable to index docs with children:
the schema must include definitions for both a uniqueKey field and the '_root_' field, using the exact same fieldType

ro0NL avatar Sep 15 '22 07:09 ro0NL

@ro0NL Do you have a reproducible example that you can share?

thomascorthals avatar Sep 15 '22 08:09 thomascorthals

Sorry about the late reply, I've been busy with other projects.

Thank you so much for providing the fix, I have validated that 6.2.6 works for our use case.

Too bad about the side effect reported in https://github.com/solariumphp/solarium/issues/992#issuecomment-1247666653. Hard to guess what the issue could be without an example though.

Should this issue be closed and a new one created for that?

jupevi avatar Sep 15 '22 12:09 jupevi

we've fixed the version to 6.2.3 for now, feel free to close

this url is in our git log :)

ro0NL avatar Sep 15 '22 12:09 ro0NL

@jupevi Solarium 6.2.8 was released with native support for JSON formatted update requests. For now it's opt-in to allow users to test it out, it'll become the default with the next major release.

Do you mind test-driving it for nested documents?

$update = $client->createUpdate();
$update->setRequestFormat($update::REQUEST_FORMAT_JSON);

thomascorthals avatar Dec 06 '22 11:12 thomascorthals

@thomascorthals Sorry for the very late reply, I haven't had a chance to work on this for a while.

Yesterday, I updated to 6.3.1 and replaced my jury-rigged json-update query type with your suggestion:

$update = $client->createUpdate();
$update->setRequestFormat($update::REQUEST_FORMAT_JSON);

So far, testing this in my local development environment and our sandbox environment I haven't found any issues. For our use-case, JSON indexing works correctly, including nested documents.

Thank you for your work!

jupevi avatar Jul 12 '23 13:07 jupevi

@jupevi

Yesterday, I updated to 6.3.1

You don't even have to set it explicitly. Solarium 6.3 uses the JSON request format by default.

thomascorthals avatar Jul 12 '23 13:07 thomascorthals