NullAway Null propagation doesn't work between Java Stream operations

A common pattern in some codebases is to filter streams by the keys of a map:

  private Stream<String> filterBasedOnMapKeys(Map<String, String> map, Stream<String> stream) {
    return stream
            .filter(map::containsKey)
            .map(value -> someMethod(map.get(value)));
  }

This fails the null check, claiming that the method someMethod is being passed a nullable value when not null is required. However, as the stream has already been filtered, the .get(...) call is safe.

Dec 13 '22 17:12 alxhill

Would love to add support for this feature, any guidance on if it's feasible within the current code structure?

Dec 13 '22 17:12 alxhill

@alxhill this is an interesting example. We basically want to learn that after the filter, all values in the resulting stream are contained in some map. I think we have the guts of the code to make this possible; from a call like filter(x -> x.getFoo() != null), we can already learn that a v.getFoo() call inside the map method is @NonNull, where v is the parameter of the map callback. For this feature, we'd need to be able to learn from filter(x -> m.containsKey(x)) that m.get(v) is @NonNull in the map callback method.

I think @lazaroclapp can comment more on how exactly we might implement this support. Also, FWIW, I think initial support will be easier to implement if a lambda is used for filter rather than a method reference, though method references should be handle-able as well.

Dec 13 '22 18:12 msridhar

So, the places to look at for stream handling are:

StreamNullabilityPropagatorFactory: Defines different stream APIs (e.g. Java 8 streams vs Rx) in a high-level declarative manner.
StreamNullabilityPropagator: The actual internals of stream propagation. This will be the more relevant one here!

The problem you are running to is that, fundamentally, the way stream propagation works for:

s.filter(a -> [Body A]).map(b -> [Body B])

Is that it looks at all the access paths reachable from a and their known nullability whenever [Body A] would return true, uproots them, and then re-roots them at b while analyzing [Body B] (I believe we do also handle method references here, but eliding that complexity for now).

So, for example, the following code works like a charm:

  private Stream<String> filterBasedOnMapKeys2(Stream<Map<String, String>> stream, String key) {
    return stream.filter(m1 -> m1.containsKey(key)).map(m2 -> m2.get(key).toString());
  }

Here the stream handler sees that, for lambda m1 -> m1.containsKey(key), the only "return" producing True happens when {m1[key] : NONNULL}, which is a fact over an access path rooted at m1, the filter lambda arg. Then, when analyzing m2 -> m2.get(key).toString(), it adds the fact {m2[key] : NONNULL} on entry to the lambda, since it uproots all APs for m1 in the previous lambda and re-roots them at m2. This guarantees that m2.get(key) is non-null, and thus there is no false positive.

Unfortunately, for:

  private Stream<String> filterBasedOnMapKeys3(Map<String, String> map, Stream<String> stream) {
    return stream.filter(k -> map.containsKey(k)).map(value -> map.get(value).toString());
  }

(Simplified from your example, the method ref might make things slightly more complex on the handler, but it is not the main issue)

Well, here the fact {map[k] : NONNULL} is NOT about an access path rooted at the arguments of the filter lambda (i.e. k), but one rooted at map, and our heuristic breaks entirely.

It might be possible to extend the stream handler to also propagate facts belonging to the shared closure of all lambdas in the stream, but I am not sure how easy or hard that will be to do robustly. We would also need to re-map the map[k] AP to map[value], somehow.

Dec 13 '22 21:12 lazaroclapp

It might be possible to extend the stream handler to also propagate facts belonging to the shared closure of all lambdas in the stream, but I am not sure how easy or hard that will be to do robustly. We would also need to re-map the map[k] AP to map[value], somehow.

@lazaroclapp conceptually, it seems simple to me to look for any access path in the store "mentioning" the parameter k in filter and then substituting value for k for the map callback. Particularly for a case like map[k] where map is an effectively-final local from the closure, this seems relatively safe. Is your concern at a conceptual level or about implementation?

Dec 13 '22 23:12 msridhar

It might be possible to extend the stream handler to also propagate facts belonging to the shared closure of all lambdas in the stream, but I am not sure how easy or hard that will be to do robustly. We would also need to re-map the map[k] AP to map[value], somehow.

@lazaroclapp conceptually, it seems simple to me to look for any access path in the store "mentioning" the parameter k in filter and then substituting value for k for the map callback. Particularly for a case like map[k] where map is an effectively-final local from the closure, this seems relatively safe. Is your concern at a conceptual level or about implementation?

The main change, and I think it is significant, is that now we need to track access paths that are part of the shared closure here and we need to propagate their nullability inter-procedurally, which we weren't doing before. The stream handler is very specifically reasoning about "paths rooted at the argument of the filter that are re-rooted at the argument of the following stream operations", it is not taking the full store and re-mapping stuff.

For example, this will also result in an error:

    final String key = stream.findAny().orElse("foo");
    return stream.filter(k -> map.containsKey(key)).map(value -> map.get(key).toString());

Even though it is map.containsKey(key) and map.get(key), without any renaming needed, because it's dealing with two separate procedures with their own nullability stores that don't fall under the very specific case that the stream handler uses to initialize the map lambda nullability store.

So this does change somewhat the scope of what StreamNullabilityPropagator is transferring between the different stream operations.

Plus, now we need to worry about any stream operators that introduce delays or affect threading.

This is safe (modulo aliasing):

  private Stream<String> foo(Observable<Map<String, String>> obs, String key) {
    return obs.filter(m1 -> m1.containsKey(key))
                      .observeOn(...)
                      .delay(...)
                      .map(m2 -> m2.get(key).toString());
  }

This isn't:

  private Stream<String> filterBasedOnMapKeys(Map<String, String> map, Observable<String> stream) {
    final String key = stream.findAny().orElse("foo");
    return stream.filter(k -> map.containsKey(key))
                      .observeOn(...)
                      .delay(...)
                      .map(value -> map.get(key).toString());
    map.remove(key); // potentially concurrent with delay()
  }

We have guarantees about the values flowing through the stream that aren't there for random variables the lambdas in the stream just happen to close over...

p.s. Not saying there can't or shouldn't be a way to support the original idiom mentioned in this issue, it's just that I don't think it's quite a trivial extension of our current stream handling support.

Dec 14 '22 00:12 lazaroclapp

The stream handler is very specifically reasoning about "paths rooted at the argument of the filter that are re-rooted at the argument of the following stream operations", it is not taking the full store and re-mapping stuff.

I guess I was thinking about a generalization to "paths involving the argument of the filter that are re-written to use the argument of the subsequent stream operations." This assumes such paths are already tracked in the store when analyzing the filter callback. If we don't do that already, adding that would potentially be pretty intrusive.

Regarding calls to map.remove(), we are already unsound for such calls even ignoring streams :-) It is indeed possible, however, that adding to such unsoundness with stream handling could lead to real-world false negatives.

In any case, bottom-line @alxhill is that it may not be straightforward to extend the current code to handle cases like yours. As a (somewhat unsatisfying) workaround for now, you can add a downcast around the get() calls.

Dec 14 '22 05:12 msridhar

Regarding calls to map.remove(), we are already unsound for such calls even ignoring streams :-)

Fair. I don't quite remember the semantics of remove() on existing APs involving that map and key. I guess the general point was: for paths rooted at elements in the stream we don't care about can be happening in parallel with the stream, for paths rooted elsewhere but involving stream elements... I am not so sure.

We could always try it, we are unsound in worse ways in some cases (i.e. general aliasing), but it is a change that needs some thinking about and some testing. Happy to either accept a PR here or to try and do it when we can, but I do feel there will be a few required iterations to get this right.

Conservatively, we would have two kinds of "passthrough" stream methods to get this to work, one which is what we currently consider passthrough (including delay, observeOn, etc) and a more restricted set where we expect context changes out of the stream to be, well, not impossible, but a lot less likely. We could then allow APs rooted at stream elements to flow through both kinds, while other APs mentioning stream elements flow only through the second kind. Or is this overly convoluted? (It's not like this approach is sound, per se, since technically the scheduler can switch threads at any point, without explicit wait or threading operations, but the risk seems higher when explicit delays or thread-switching are introduced). Another problem is that the stream might already be running on another thread by default, relative to the method in which map is defined, and there is no local way of knowing that...

Dec 14 '22 18:12 lazaroclapp