Elastica icon indicating copy to clipboard operation
Elastica copied to clipboard

Aggregated data can only be explored as array

Open defgenx opened this issue 11 years ago • 7 comments

I recently used aggregations with Elastica. When I call getAggregation('my_aggregation'); I can get a first level aggregation, but it can only be explored as array that is not very friendly and clean. Also, I cannot get the nested aggregations and that's a real problem.

I think it can be a plus if the developer can get each aggregation as a Elastica\Result object to work with it. Is there a work in progress on aggregations ?

defgenx avatar Mar 27 '15 12:03 defgenx

@Defgenx Currently the implementations of aggregations in the result object is still very rudimentary. I agree that there should be a more sophisticated way to use it in the result object. But as far as I know nobody is working on this currently. Interested to pick it up?

ruflin avatar Apr 06 '15 23:04 ruflin

@ruflin I know that aggregations are still new in ES and that's not easy to choose the correct way to do it if it changes. I made a little helper for my company but that's not something that can be pushed in the lib as it is. I don't have so much free time but I'll try to do something :).

defgenx avatar Apr 07 '15 13:04 defgenx

@Defgenx Sounds good. Looking forward to it.

ruflin avatar Apr 11 '15 04:04 ruflin

I think the problem here is not the fact that the result is an array, but that it is nested. Even worse, the nesting structure is full of "meta keys". They expend the amount of code required to process aggregation results a lot.

Example with 4 nested aggregations:

date histogram (multibucket)
    terms (multibucket)
        sum (singlebucket)
        avg (singlebucket)
$query = [
    // ...
    "aggs" => [
        "name_of_date_histogram_agg" => [
            "date_histogram" => [
                "field" => "date",
                "interval" => "day",
                "aggs" => [
                    "name_of_terms_agg" => [
                        "terms" => [
                            "field" => "gender",
                            "aggs" => [
                                "name_of_sum_agg" => [
                                    "sum" => [
                                        "field" => "change"
                                    ],
                                ],
                                "name_of_avg_agg" => [
                                    "avg" => [ 
                                        "field" => "grade" 
                                    ]
                                ]
                            ]
                        ]
                    ]
                ]
            ]
        ]
    ]
];
$result = [
    // ...
    'aggregations' => [
        "name_of_date_histogram_agg" => [
            "buckets" => [
                [
                    "key_as_string" => "2013-02-02",
                    "key" => 1328140800000,
                    "doc_count" => 20,
                    "aggregations" => [
                        "name_of_terms_agg" => [
                            "buckets" => [
                                [
                                    "key" => "male",
                                    "doc_count" => 10,
                                    "aggregations" => [
                                        "name_of_sum_agg" => [
                                            "value"=>  13
                                        ],
                                        "name_of_avg_agg" => [
                                            "value"=>  1
                                        ]
                                    ]
                                ],
                                [
                                    "key" => "female",
                                    "doc_count" => 10,
                                    "aggregations" => [
                                        "name_of_sum_agg" => [
                                            "value"=>  2.18
                                        ],
                                        "name_of_avg_agg" => [
                                            "value"=>  2
                                        ]
                                    ]
                                ],
                            ]
                        ]
                    ]
                ],
                [
                    "key_as_string" => "2013-03-02",
                    "key" => 1330646400000,
                    "doc_count" => 11,
                    "aggregations" => [
                        "name_of_terms_agg" => [
                            "buckets" => [
                                [
                                    "key" => "male",
                                    "doc_count" => 5,
                                    "aggregations" => [
                                        "name_of_sum_agg" => [
                                            "value"=> 45
                                        ],
                                        "name_of_avg_agg" => [
                                            "value"=>  3
                                        ]
                                    ]
                                ],
                                [
                                    "key" => "female",
                                    "doc_count" => 6,
                                    "aggregations" => [
                                        "name_of_sum_agg" => [
                                            "value"=>  100.13
                                        ],
                                        "name_of_avg_agg" => [
                                            "value"=>  4
                                        ]
                                    ]
                                ]
                            ]
                        ]
                    ]
                ]
            ]
        ]
    ]
];

To get a single value out of that hell you end up with something like:

$result['aggregation']['name_of_date_histogram_agg']['buckets'][0]['aggregations']['name_of_terms_agg']['buckets'][0]['aggregations']['name_of_sum_agg']['value'] == 13

It is horrible... And I think no one can write this with at least 2 typos at a time. Now if you wrap every sub aggregation into a AggregationResult class with AggregationResult::getBucket($key), AggregationResult::getAggregation($name) the amount of code needed is basically the same:

$result
    ->getAggregation('name_of_date_histogram_agg')
        ->getBucket(0)
            ->getAggregation('name_of_terms_agg')
                ->getBucket(0)
                    ->getAggregation('name_of_sum_agg')
                        ->getValue() == 13;

So it has maybe a better format and the IDE might give you auto complete but at the end there is no real benefit.

A better solution would be to resolve that nested structure to a table format (like in MySQL when you have multiple GROUP BY clauses). Actually this is the main use case for aggregations in my experience: transform the result in a table format to plot the data

name_of_date_histogram_agg name_of_terms_agg name_of_sum_agg name_of_avg_agg
1328140800000 male 13 1
1328140800000 female 2.18 2
1330646400000 male 45 3
1330646400000 female 100.13 4

That is not complete, because the results contains additional information per key. It is very hard to map that to a relational table: A table contains only one value per column, but elasticsearch has multiple (e.g. one date histogram bucket has "key", "key_as_string" and "doc_count"). So the resulting table (or call it the "flattened result") would look like:

$table = [
    [
        "name_of_date_histogram_agg" => [
            "key" => 1328140800000,
            "key_as_string" => "2013-02-02",
            "doc_count" => 20
        ],
        "name_of_terms_agg" => [
            "key" => "male",
            "doc_count" => 10,
        ],
        "name_of_sum_agg" => 13,
        "name_of_avg_agg" => 1
    ],
    [
        "name_of_date_histogram_agg" => [
            "key" => 1328140800000,
            "key_as_string" => "2013-02-02",
            "doc_count" => 20
        ],
        "name_of_terms_agg" => [
            "key" => "female",
            "doc_count" => 10,
        ],
        "name_of_sum_agg" => 2.18,
        "name_of_avg_agg" => 2
    ],
    [
        "name_of_date_histogram_agg" => [
            "key" => 1330646400000,
            "key_as_string" => "2013-03-02",
            "doc_count" => 11
        ],
        "name_of_terms_agg" => [
            "key" => "male",
            "doc_count" => 5,
        ],
        "name_of_sum_agg" => 45,
        "name_of_avg_agg" => 3
    ],
    [
        "name_of_date_histogram_agg" => [
            "key" => 1330646400000,
            "key_as_string" => "2013-03-02",
            "doc_count" => 11
        ],
        "name_of_terms_agg" => [
            "key" => "female",
            "doc_count" => 6,
        ],
        "name_of_sum_agg" => 100.13,
        "name_of_avg_agg" => 4
    ]
];

What do you have to write the fetch the same result then in the other 2 examples above?

$table[0]['name_of_sum_agg'] == 13;

Downside:

  • $table has a lot of duplicate data
  • Implementation requires some complicated recursion and might be slow on very large, deep nested results

webdevsHub avatar May 12 '15 10:05 webdevsHub

Yes, the real problem is the nested aggregations, not the array itself.

The ResultSet object does not have to contain the aggregations (and nested ones) as objects. The goal is to get an aggregation object on demand.

In my case, for performances and processing purposes, I created a Symfony service that allow me to give first level aggregation(s) and get every "level + 1" aggregations and its direct parent. It automatically detects the aggregation type and store data as I want. If we need "level + 2" aggregations, simply give the "level + 1" aggregation as input.

I agree that it's basically the same but for instance, the top_hit aggregation is returned as array in aggregations and it's really useful to have every document as Result object. Moreover it's cleaner than getting values by index thought.

Thanks for sharing your solution with us :D.

defgenx avatar May 12 '15 11:05 defgenx

Can you please give some code or a link to the code of that service. I did not quite follow that.

webdevsHub avatar May 12 '15 17:05 webdevsHub

Sorry if my explanations were unclear. This example is not a service (because of the construct and all), I rewritten it for the occasion - not tested. As you can see, the approach is very close to your solution.

/**
 * Construct
 */
public function __construct(array $aggregation, array $options = array())
{
    if (isset($options['aggrName'])) {
        $this->setAggregation($aggregation, $options['aggrName']);
    }
    else {
        $this->setAggregations($aggregation);
    }
}

/**
 * @param $aggregation
 */
private function setAggregation(array $aggregation, $aggrName)
{
     // First, we try to test if the search aggregation is a nested one.
    // (we should add a method to check the validity of the aggregation name).
    // If it's the case we have to switch into this aggregation
    if (isset($aggregation[$aggrName])) {
        $aggregation = $aggregation[$aggrName];
    }

    $this->sum_other_doc_count[$aggrName] = (isset($aggregation[self::AGGREGATION_RESULT_KEY_SUM_DOC])) ? $aggregation[self::AGGREGATION_RESULT_KEY_SUM_DOC] : 0;

    if (isset($aggregation[self::AGGREGATION_RESULT_KEY_BUCKETS])) {
        foreach ($aggregation[self::AGGREGATION_RESULT_KEY_BUCKETS] as $key => $aggrResult) {
            $this->key[$aggrName][] = $aggrResult[self::AGGREGATION_RESULT_KEY];
            $this->doc_count[$aggrName][] = $aggrResult[self::AGGREGATION_RESULT_KEY_DOC_COUNT];

            // We have prefixed everything with the main aggregation value
            foreach ($aggrResult as $subKey => $subAggrResult) {
                // Simple value aggregation
                if (is_array($subAggrResult) && !isset($subAggrResult[self::AGGREGATION_RESULT_KEY_HITS])) {
                    $this->subAggregation[$aggrName][$subKey][$aggrResult[self::AGGREGATION_RESULT_KEY]] = $subAggrResult[self::AGGREGATION_RESULT_KEY_VALUE];
                }
                // Top hit aggregation type
                elseif (is_array($subAggrResult) && isset($subAggrResult[self::AGGREGATION_RESULT_KEY_HITS])) {
                    if (!$subAggrResult instanceof Result) {
                        foreach($subAggrResult[self::AGGREGATION_RESULT_KEY_HITS][self::AGGREGATION_RESULT_KEY_HITS] as $eltSubAggr)
                            $this->subAggregation[$aggrName][$subKey][$aggrResult[self::AGGREGATION_RESULT_KEY]][] = new Result($eltSubAggr);
                     }
                 }
             }
         }
        $this->aggregation[$aggrName] = $aggregation[self::AGGREGATION_RESULT_KEY_BUCKETS];
    }
    else {
        $this->aggregation[$aggrName] = $aggregation;
    }
}

/**
 * @param $aggregations
 */
private function setAggregations(array $aggregations)
{
    foreach($aggregations as $aggregationName => $aggregation) {
        $this->setAggregation($aggregation, $aggregationName);
    }
}

defgenx avatar May 13 '15 09:05 defgenx