Option to disable Zero-filling of aggregateBy
Problem The MapAggregator class provides zero-filled results. This may generate a lot of undesired "zero" data entries.
Describe the solution you'd like Zero-filling should be optional or disableable by a flag. For example:
mapaggregator.zerofill(false)
Additional context
When working with multiple aggregateBy steps, and/or relatively sparse datasets (e.g. aggregation by user ids), it's not uncommon that most entries in the zero-filled result are zero. In such cases, these zero entries are typically not of interest, and could in worst case be a significant waste of CPU time and memory.
What would be your expected result of
.timestamps("2017-01-01", "2019-01-01", Interval.YEARLY)
.aggregateByTimestamp()
.flatMap(snap -> {
if(!snap.getTimestamp().equals(year2018)){
return Collections.singleton(rnd.nextBoolean()?
new Agg("A","Y"):
new Agg("B","X"));
}
return Collections.emptyList();}
)
.aggregateBy(Agg::level1)
.aggregateBy(Agg::level2)
.count()
in the current version the result would be:
2017-01-01&A&X->0
2017-01-01&A&Y->6
2017-01-01&B&X->7
2017-01-01&B&Y->0
2018-01-01&A&X->0
2018-01-01&A&Y->0
2018-01-01&B&X->0
2018-01-01&B&Y->0
2019-01-01&A&X->0
2019-01-01&A&Y->32
2019-01-01&B&X->24
2019-01-01&B&Y->0
sorry, I don't understand the question
IMO, this result is fine. If zerofilling was disabled (as per this feature request), I would expect only results with values > 0 in the final result.
there was a branch for this, which unfortunately never made it into a PR, but the central idea could be picked up again:
https://github.com/GIScience/oshdb/compare/optional-zerofilling
maybe the wording could be improved, because I'm not sure everyone understands what we mean with zerofill. Maybe a term sparse result would be more intuitive? What do you think?
Btw, we would also need to specify what should happen if a user manually specifies (at least) one aggregateBy with a zerofill and still requests the non-zerofilled output, which would contradict each other: As in mapReducer.aggregateBy(Agg::level1, EnumSet.of(A, B)).zerofill(false).count(): Should the zerofill(false) take precedence over the manually specified zerofill keys (A, B)? Alternatively, there could be an exception be thrown. I'd prefer the first solution at the moment.
great you picked this up again! Yes, I like the wording of mapaggregator.sparseResult(true) with .sparseResult(false) being the default (current procedure).
I tend more towards throwing an exception but both solution are fine.