oshdb icon indicating copy to clipboard operation
oshdb copied to clipboard

Option to disable Zero-filling of aggregateBy

Open SlowMo24 opened this issue 6 years ago • 7 comments

Problem The MapAggregator class provides zero-filled results. This may generate a lot of undesired "zero" data entries.

Describe the solution you'd like Zero-filling should be optional or disableable by a flag. For example:

mapaggregator.zerofill(false)

Additional context When working with multiple aggregateBy steps, and/or relatively sparse datasets (e.g. aggregation by user ids), it's not uncommon that most entries in the zero-filled result are zero. In such cases, these zero entries are typically not of interest, and could in worst case be a significant waste of CPU time and memory.

SlowMo24 avatar Jul 09 '19 12:07 SlowMo24

What would be your expected result of

.timestamps("2017-01-01", "2019-01-01", Interval.YEARLY)
.aggregateByTimestamp()
.flatMap(snap -> { 
  if(!snap.getTimestamp().equals(year2018)){
    return Collections.singleton(rnd.nextBoolean()?
        new Agg("A","Y"):
        new Agg("B","X"));
  }
  return Collections.emptyList();}
)
.aggregateBy(Agg::level1)
.aggregateBy(Agg::level2)
.count()

rtroilo avatar Feb 10 '20 09:02 rtroilo

in the current version the result would be:

2017-01-01&A&X->0
2017-01-01&A&Y->6
2017-01-01&B&X->7
2017-01-01&B&Y->0
2018-01-01&A&X->0
2018-01-01&A&Y->0
2018-01-01&B&X->0
2018-01-01&B&Y->0
2019-01-01&A&X->0
2019-01-01&A&Y->32
2019-01-01&B&X->24
2019-01-01&B&Y->0

rtroilo avatar Feb 10 '20 09:02 rtroilo

sorry, I don't understand the question

SlowMo24 avatar Feb 10 '20 11:02 SlowMo24

IMO, this result is fine. If zerofilling was disabled (as per this feature request), I would expect only results with values > 0 in the final result.

tyrasd avatar Feb 11 '20 13:02 tyrasd

there was a branch for this, which unfortunately never made it into a PR, but the central idea could be picked up again:

https://github.com/GIScience/oshdb/compare/optional-zerofilling

tyrasd avatar Jul 23 '21 13:07 tyrasd

maybe the wording could be improved, because I'm not sure everyone understands what we mean with zerofill. Maybe a term sparse result would be more intuitive? What do you think?

Btw, we would also need to specify what should happen if a user manually specifies (at least) one aggregateBy with a zerofill and still requests the non-zerofilled output, which would contradict each other: As in mapReducer.aggregateBy(Agg::level1, EnumSet.of(A, B)).zerofill(false).count(): Should the zerofill(false) take precedence over the manually specified zerofill keys (A, B)? Alternatively, there could be an exception be thrown. I'd prefer the first solution at the moment.

tyrasd avatar Jul 23 '21 13:07 tyrasd

great you picked this up again! Yes, I like the wording of mapaggregator.sparseResult(true) with .sparseResult(false) being the default (current procedure).

I tend more towards throwing an exception but both solution are fine.

SlowMo24 avatar Jul 26 '21 12:07 SlowMo24