spring-cloud-stream icon indicating copy to clipboard operation
spring-cloud-stream copied to clipboard

Clarify documentation w.r.t. partitioning with Kafka Binder

Open filpano opened this issue 3 years ago • 0 comments

Describe the issue

When not using a specific PartitionSelectorExpression to explicitly determine the partition strategy that should be used, the org.springframework.cloud.stream.binder.PartitionHandler.DefaultPartitionSelector is used which essentially performs hashCode() % numPartitions to determine the partition that should be produced to. This is (AFAIK) intentionally technology-agnostic so as to support natively non-partitionable data sinks.

The current documentation also mentions this:

The preceding configuration uses the default partitioning (key.hashCode() % partitionCount). This may or may not provide a suitably balanced algorithm, depending on the key values. You can override this default by using the partitionSelectorExpression or partitionSelectorClass properties.

IMO, it would be extremely helpful for newcomers, to include a warning (or extend the existing info blurb) in this spot to explicitly mention that this is not the way the default Kafka producer partitioning works (which uses the murmur2 hash function). Most notably, this has the consequence that messages using the same key values, one produced by SCS and one produced by a standard Java Kafka Producer or e.g. Kafka Streams, are not able to be co-partitioned out of the box. Depending on various assumptions, this may be quite a painful realization at that point in time.

Version of the framework

3.4.2 / Current (documentation only).

Expected behavior My suggestion would be extending the existing info section (emphasis mine):

The preceding configuration uses the default partitioning (key.hashCode() % partitionCount). This may or may not provide a suitably balanced algorithm, depending on the key values. Note that this partitioning differs from the default used by a standalone Kafka producer, such as the one used by Kafka Streams, meaning that the same key value may balance differently when produced by those clients. You can override this default by using the partitionSelectorExpression or partitionSelectorClass properties.

I've opted to retain the phrasing of "balance differently" rather than "be distributed differently across partitions" so that it flows better with the previous sentence. I've also opted to exclude the explicit mention of co-partitioning so as to not bring in even more themes, but I'm open to feedback.

Additional context

For more context on this question, see the following SO question and the comments on the answer from @garyrussell.

I wasn't quite sure whether the SCS team would want external PRs for something as minute as this, but I'd be glad to submit one if the above makes sense and we can agree on a fitting phrasing.

filpano avatar Aug 02 '22 16:08 filpano