p4runtime Add flag to MulticastGroupEntry to prune replicas for the ingress port

In some cases, a controller might want to create multicast groups that include the ingress port of a packet, while it would be desirable to have a way to instruct the PRE to avoid producing replicas directed to the ingress port. A use case for this feature is L2 broadcast groups for ARP requests, where the request is broadcasted to all ports associated with the same VLAN identifier, except for the ingress port.

Alternative approaches to produce an equivalent forwarding behavior are:

Drop the packet in the egress pipeline.
Create many multicast groups, each one excluding a different port, and provide match-action machinery to use one of such groups based on the packet ingress port.

In the first case, for some targets, this approach will consume unnecessary buffer and bandwidth resources for the outgoing direction on the ingress port. In the second case, it's an inefficient use of PRE resources, since we are creating many groups while we could have only one.

Instead, the traffic manager of some PSA targets provides capabilities to avoid producing replicas for the ingress port, but such capability is not exposed by P4Runtime.

My proposal is to extend the MulticastGroupEntry message with a new flag to instruct the PRE as described above. For example:

message MulticastGroupEntry {
  uint32 multicast_group_id = 1;
  repeated Replica replicas = 2;
  // If true and replicas contain the ingress port, instruct the
  // PRE to skip that replica. Default is false.
  bool with_ingress_port_pruning = 3;
}

Any thoughts? If there are no objections I can open a pull request with the proto change above.

Nov 01 '18 00:11 ccascone

Just some thoughts, but sorry if they are incoherent or don't provide a clear direction here.

This gets into what it means to be PSA compliant, or not, I suppose.

Suppose a device's manufacturer would like to call the device PSA compliant, but it can only do 1. or 2., not the new 3. you are proposing.

Should it be allowed to return an "unsupported" error if a controller attempts to configure a group with 'with_ingress_port_pruning' equal to true?

That is the main question I have, and I don't have a great answer, because if it is allowed to return an unsupported error, then either the controller must say "this controller software only works with PSA devices that do support optional feature X", or you need to automatically fall back to one of the other behaviors. Which of those two alternatives do you want your controller software to do?

A more nit-picky detailed question about the precise behavior of the 'with_ingress_port_pruning = true": PSA explicitly requires that a multicast group allow multiple copies to be sent to the same output port, with different values of the instance field. Should all of those copies be pruned? If you want to support multicast replication into tunnel interfaces, or VLAN-based SVI interfaces, then you need to be able to prune based on the notion of a 'logical interface id', not a physical port number. PSA doesn't require support for anything but physical port IDs yet.

Anyway, food for thought.

Nov 01 '18 20:11 jafingerhut

Note: I do believe that it should be the goal of PSA to "raise the bar" on requirements over time, e.g. device X might be compliant with PSA v1.0, but it fails to be PSA v2.3 compliant, because v2.3 has requirements A, B, and C that X cannot do.

So part of my comments are: Are we saying that we should say that this feature is a requirement to be PSA v1.1 compliant, but not PSA v1.0 compliant?

Nov 01 '18 20:11 jafingerhut

FYI, most switch ASICs I know of (not P4-programmable ones) implement 1. The buffer space and egress bandwidth only become an issue if there are high rates of L2/L3 multicast, and in most deployment scenarios I know of, there isn't.

Nov 02 '18 05:11 jafingerhut

@jafingerhut these are excellent points.

Regarding PSA "compliance," I'm not sure what's best to make requirements explicit, if via version numbers or via an out-of-band mechanism to exchange capabilities. The latter would allow a controller to handle the different cases at runtime, for example, implement pruning using 1 instead of 3 if pruning in the PRE is not supported. Also, a controller could handle similarly the case of a target returning UNSUPPORTED upon writing a multicast group entry with with_ingress_port_pruning = true. Not sure what's the best approach here.

On the other side, I did a bit of research and it seems the P4 target I'm using supports more generic PRE pruning capabilities other than ingress port filtering. This mechanism requires some additional metadata to be set in the ingress pipeline to tell the PRE which replica(s) to skip. That can be used to implement filtering based on arbitrary identifiers, e.g., physical or logical port IDs. However, to support such more generic pruning capability, I think we would need the PSA spec to formalize this metadata field before thinking about runtime control.

You are right about 1 being an issue only with high rates of L2/L3 multicast. Also in the use cases we are working on at ONF the bandwidth consumed by dropped replicas is negligible and does not affect other traffic. I'm ok if we want to postpone this discussion or even drop this proposal, as I don't have a more critical use case in mind at the moment. However, I found that there are other cases for PRE pruning other than ingress port filtering, such as PIM bidirectional DF check, MLAG pruning, VPLS split horizon check and probably others. I haven't tried implementing these behaviors in P4, but I have the feeling these cases would need the more general pruning capability described above.

In conclusion: I don't feel particularly strong about this proposal specific to ingress port pruning. If others see value in it, I'm happy to work for a PR. I think the more general pruning capability would be more useful, but I'm not sure how much common that is to be specified in PSA.

Nov 02 '18 23:11 ccascone

I don't know why I didn't remember it before your latest comment, but there was a time when I was thinking that for things like DF check, MLAG pruning, split horizon checks, etc., it might be nice if there was a packet replication engine that was P4-programmable. It could have P4-programmable table entries with user-defined metadata in addition to each (egress port, instance) pair that is supported by PSA today, and the P4 program would take that and whatever fields it wanted from the packet and decide 'yes, this should be a copy, with this modified metadata', or 'no, don't make this copy, move on to the next element of the replication list'.

Such a thing is easy to imagine. The hard problem is not "could it be done?" but "what fraction of ASIC vendors would do it, and with what limitations?" Most of what is in PSA is at the boundary of what is P4 programmable, and what is currently not, and moving that boundary will almost always leave behind some ASICs. FPGA, NPU, and pure software implementations will be able to keep up with such additions relatively easily, but they are not the devices with the lowest cost per terabit/second.

As I said before, I do think there should be future release of PSA that "raise the bar" on requirements, and also in some cases specify how new features work, but make them optional, not required.

Sorry for rambling, but if we could come up with a more general strategy for creating optional add-on functionality, I would feel much more comfortable saying "Yeah, go specify a new optional PSA feature for that. Here are the guidelines and procedures to do it."

Nov 03 '18 01:11 jafingerhut

I had another thought that might go nowhere, but see what you think of it.

Suppose we say that PSA has a special name of egress control, say "EgressMulticastFiltering", which takes the parsed packet, including user-defined metadata, and the egress_port and instance metadata fields from the PRE as inputs, and it is allowed to do whatever it wants to determine whether to drop or not drop the packet.

If a PSA program calls that control first in the egress control, and if the target-specific compiler determines that the behavior of that control is "simple enough" for its target to handle in its just-before-storing-copies-in-the-packet-buffer packet replication engine logic, then that compiler should try to implement that drop logic in its packet replication engine rather than in egress.

If the target-specific compiler determines the behavior is too complex to do in its packet replication engine logic, it is allowed to implement it at the beginning of egress, where it appears in the P4 code.

We could even write up some example "drop decision" logic for the cases you describe, and I think most or all of them have these properties:

the drop/keep decision is independent of which other entries are in the multicast (egress_port, instance) replication set, i.e. each decision can be made independently, with no history needed to be kept from one potential copy to the next.
the drop/keep decision can be made solely based on a few user-defined metadata fields determined in the developer's ingress P4 code, plus (egress_port, instance), and a handful of other user-defined fields that could be looked up in a table with a key consisting solely of (mcast_grp, egress_port, instance).

With those kinds of restrictions, it seems to me a much more straightforward problem for a compiler to figure out whether the drop/keep decision would fit in its packet replication engine logic, or should be delayed until egress. A target-specific compiler would be encouraged to tell the developer what choice was made when it compiled the program, and ideally clues on why it didn't go into the packet replication engine, if it didn't.

One nice thing is that the controller's view of configuring any tables involved could be, I think, identical, regardless of which choice the compiler made.

Thoughts?

Nov 04 '18 03:11 jafingerhut

Postponed to after v1.0.0 release, as per discussion at the 11/07/2018 P4 API WG meeting

Nov 08 '18 23:11 antoninbas