envoy address: lazily create string representation of an IP address

Commit Message: The current Address implementation will create the string representation of an IP address upon construction. This is probably okay for TCP, but for each UDP packet it's a lot of CPU cost. This change will lazy create the string representation of an IP address if it is created by Address::InstanceFactory.

Additional Description: for address instances created without factory, they will still keep the old behavior for lack of efficient way to runtime guard the new behavior. bazel run test/common/network:address_impl_speed_test_benchmark_test -c opt gives significant diff with and without lazy creation of string form.

Lazy creation 
-----------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate               981570 ns       981351 ns            1
ipv4InstanceCreateViaFactory     619560 ns       619751 ns            1
ipv6InstanceCreate              2475770 ns      2476094 ns            1
ipv6InstanceCreateViaFactory     510919 ns       510851 ns            1

No lazy creation
-----------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate               866841 ns       866243 ns            1
ipv4InstanceCreateViaFactory    1048951 ns      1049241 ns            1
ipv6InstanceCreate              2424031 ns      2424304 ns            1
ipv6InstanceCreateViaFactory    2900400 ns      2900675 ns            1

Before this change
-----------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate               650960 ns       650641 ns            1
ipv4InstanceCreateViaFactory     863200 ns       863162 ns            1
ipv6InstanceCreate              2144770 ns      2144933 ns            1
ipv6InstanceCreateViaFactory    2512201 ns      2512295 ns            1

Risk Level: low Testing: added new unit test and benchmark tests Docs Changes: N/A Release Notes: yes Platform Specific Features: N/A Runtime guard: envoy.restart_features.address_factory_lazy_create_address_string

May 01 '24 22:05 danzh2010

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/33917 was opened by danzh2010.

see: more, trace.

May 01 '24 22:05 repokitteh-read-only[bot]

Assigning Ryan for a first-pass. /assign @RyanTheOptimist

May 02 '24 13:05 adisuissa

Assigning !Googler reviewer /assign @ggreenway

May 03 '24 14:05 adisuissa

/retest

May 03 '24 15:05 danzh2010

/retest

May 07 '24 19:05 danzh2010

LGTM. Are the benchmarks in the PR description accurate with the new implementation?

Updated

May 08 '24 14:05 danzh2010

LGTM. Are the benchmarks in the PR description accurate with the new implementation?

Updated

Hm. The new benchmarks suggest that the Create() methods are now slower but the CreateViaFactory() methods are faster. Is this really a win? Is one of those methods more popular than the other in the codebase?

May 08 '24 14:05 RyanTheOptimist

LGTM. Are the benchmarks in the PR description accurate with the new implementation?

Updated

Hm. The new benchmarks suggest that the Create() methods are now slower but the CreateViaFactory() methods are faster. Is this really a win? Is one of those methods more popular than the other in the codebase?

Sorry, I didn't run the no lazy creation benchmark. Updated it as well. Create() shouldn't have any difference with flag on and off as the change only takes effect in CreateViaFactory() . But all creation indeed does the extra work to initialize the std::once_flag now. So if comparing to the old implementation Create() is slightly more expensive but within the same magnitude. And the numbers differ slightly between different runs.

May 08 '24 16:05 danzh2010

LGTM. Are the benchmarks in the PR description accurate with the new implementation?

Updated

Hm. The new benchmarks suggest that the Create() methods are now slower but the CreateViaFactory() methods are faster. Is this really a win? Is one of those methods more popular than the other in the codebase?

Sorry, I didn't run the no lazy creation benchmark. Updated it as well. Create() shouldn't have any difference with flag on and off as the change only takes effect in CreateViaFactory() . But all creation indeed does the extra work to initialize the std::once_flag now. So if comparing to the old implementation Create() is slightly more expensive but within the same magnitude. And the numbers differ slightly between different runs.

	Base	Lazy	Delta	% Change	No Lazy	Delta	% Change
ipv4InstanceCreate	650641	981351	330710	50.83%	866243	215602	33.14%
ipv4InstanceCreateViaFactory	863162	619751	-243411	-28.20%	1049241	186079	21.56%
ipv6InstanceCreate	2144933	2476094	331161	15.44%	2424304	279371	13.02%
ipv6InstanceCreateViaFactory	2512295	510851	-2001444	-79.67%	2900675	388380	15.46%

I dumped the data into a spreadsheet so I could compare them quantitatively. It looks like this is a 15%-30% regression in all cases when the flag is false ("No Lazy"). It looks like a 15-50% regression to the Create() method when the flag is true. These both seem problematic?

The benchmark do suggest 30-80%(!!!!) improvement in the ViaFactory() methods, so that's great.

Am I interpreting the data correctly? If so, I'm not sure about how to balance the regression in the Create() (and flag-false) cases with the improvement in the ViaFactory() method.

May 08 '24 20:05 RyanTheOptimist

Base Lazy Delta % Change No Lazy Delta % Change ipv4InstanceCreate 650641 981351 330710 50.83% 866243 215602 33.14% ipv4InstanceCreateViaFactory 863162 619751 -243411 -28.20% 1049241 186079 21.56% ipv6InstanceCreate 2144933 2476094 331161 15.44% 2424304 279371 13.02% ipv6InstanceCreateViaFactory 2512295 510851 -2001444 -79.67% 2900675 388380 15.46% I dumped the data into a spreadsheet so I could compare them quantitatively. It looks like this is a 15%-30% regression in all cases when the flag is false ("No Lazy"). It looks like a 15-50% regression to the Create() method when the flag is true. These both seem problematic?

There shouldn't be any difference with Create() between lazy and no-lazy runs. They both instantiate std::once_flag and convert the socket address into string. The regression difference between these two (50% v.s. 33%) maybe just noise between 2 runs of the benchmark tests. But from the last column there is 15~30% regression in Create() with the new implementation if not lazily do the string conversion due to std::once_flag.

May 08 '24 20:05 danzh2010

It could be that some of the degradation in vs the base case is because of increased size of the objects due to adding the once_flag

May 08 '24 20:05 ggreenway

It could be that some of the degradation in vs the base case is because of increased size of the objects due to adding the once_flag

Yeah, that sounds possible. @danzh2010 can you run the test with more iterations to see if the numbers are consistent?

May 08 '24 20:05 RyanTheOptimist

Old code
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate               656954 ns       656911 ns        10000
ipv4InstanceCreateViaFactory     868333 ns       868294 ns        10000
ipv6InstanceCreate              2213465 ns      2213259 ns        10000
ipv6InstanceCreateViaFactory    2580065 ns      2579853 ns        10000


No Lazy
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate               840328 ns       840277 ns        10000
ipv4InstanceCreateViaFactory    1007456 ns      1007379 ns        10000
ipv6InstanceCreate              2387824 ns      2387553 ns        10000
ipv6InstanceCreateViaFactory    2867409 ns      2867041 ns        10000


Lazy 
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate               849292 ns       849270 ns        10000
ipv4InstanceCreateViaFactory     499743 ns       499724 ns        10000
ipv6InstanceCreate              2424963 ns      2424739 ns        10000
ipv6InstanceCreateViaFactory     498451 ns       498441 ns        10000

The Create() regression is 30% for v4 and 10% for v6. v6 address uses inet_ntop() which is more expensive.

May 08 '24 22:05 danzh2010

I would say if 30% regression causes noticeable performance degradation, the call site should switch to use the Address factory which can then lead to 25% improvement.

May 08 '24 22:05 danzh2010

I would say if 30% regression causes noticeable performance degradation, the call site should switch to use the Address factory which can then lead to 25% improvement.

Do we have an understanding of where those call sites are? I'm also concerned that this PR is a regression when the runtime flag is off. I would if a better approach would be to consider removing the addressAsString() method and replace it with a free function that does the stringification on the fly. This method is used in a lot of places, so this would be a big change, though it might not be too hard to search and replace?

I'm just feeling uncomfortable about landing a change which regresses the flag-off performance and regresses one API while improving the performance of another API without having an understanding of the impact this has on Envoy in general.

May 09 '24 15:05 RyanTheOptimist

I would say if 30% regression causes noticeable performance degradation, the call site should switch to use the Address factory which can then lead to 25% improvement.

Do we have an understanding of where those call sites are? I'm also concerned that this PR is a regression when the runtime flag is off. I would if a better approach would be to consider removing the addressAsString() method and replace it with a free function that does the stringification on the fly. This method is used in a lot of places, so this would be a big change, though it might not be too hard to search and replace?

Why would removing addressAsString() method and replace it with a free function be helpful? If the goal is stringification on the fly, we can change the implementation addressAsString() to do that.

And because there are so many different call sites, it's really hard to audit every one of them to make sure stringification on the fly won't cause performance regression for them. Let alone extensions out there.

I'm just feeling uncomfortable about landing a change which regresses the flag-off performance and regresses one API while improving the performance of another API without having an understanding of the impact this has on Envoy in general.

Would it be okay if I make these changes in a different Address::Instance impl and only use it for UDP read?

May 09 '24 16:05 danzh2010

One thing to consider is that in the UDP path (which we're trying to optimize here), we create A LOT more addresses than in TCP. My guess is for a typical http/2 request we create 1-2 addresses total. And it looks like the total time taken to do that is very small, especially relative to overall processing of an HTTP request.

Creating a new implementation of the existing Address interfaces for UDP only is an interesting idea to allow us to isolate the change to only certain parts of the code. If we're doing that, I'd recommend making the new interface always generate the string on the fly and not need any storage for it; that'll reduce the size of the object.

May 09 '24 16:05 ggreenway

These sounds like great ideas! I'd definitely be curious about the perf impact on Envoy that we we see from making addressToString() do the stringification on the fly on every call. That will for sure be a win if we never call this method, which seems like the common case. Do we have load tests which could tell us this?

A UDP specific (or really a non-caching) subclass seems like also a worthy idea to explore!

I think we probably want to avoid requiring synchronization here, if we can avoid it.

May 09 '24 16:05 RyanTheOptimist

I also wonder if there is anything we could do in the UDP code to create fewer address objects. What if we had, say, a map from sockaddr_in to IpAddress for recent addresses. Then we could just return the same object? Maybe this would be hard.

May 09 '24 16:05 RyanTheOptimist

I also wonder if there is anything we could do in the UDP code to create fewer address objects. What if we had, say, a map from sockaddr_in to IpAddress for recent addresses. Then we could just return the same object? Maybe this would be hard.

I had the same thought, but it looks like if you remove the friendly string, the IP instances are just a struct sockaddr that can be memcpy'd, so it may not save anything to cache them. But it might be worth trying/investigating. The number of unique addresses we see should be in the range of 1-2 per connection usually. We might get a big win by just keeping the most recent addresses on a connection, and comparing to those on the incoming packet, and reusing if they match. Basically an LRU cache of size 1.

May 09 '24 16:05 ggreenway

Another thought is instantiate the factory-created Address instance with the implementation doing lazy stringification.

May 09 '24 17:05 danzh2010

I also wonder if there is anything we could do in the UDP code to create fewer address objects. What if we had, say, a map from sockaddr_in to IpAddress for recent addresses. Then we could just return the same object? Maybe this would be hard.

I had the same thought, but it looks like if you remove the friendly string, the IP instances are just a struct sockaddr that can be memcpy'd, so it may not save anything to cache them. But it might be worth trying/investigating. The number of unique addresses we see should be in the range of 1-2 per connection usually. We might get a big win by just keeping the most recent addresses on a connection, and comparing to those on the incoming packet, and reusing if they match. Basically an LRU cache of size 1.

The address LRU cache will be much larger on listener side though.

May 09 '24 17:05 danzh2010

The address LRU cache will be much larger on listener side though.

Oh right. I was thinking of on the connection, but we need to have already created the address to match a packet to a connection, so that doesn't work.

May 09 '24 17:05 ggreenway

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Jun 09 '24 00:06 github-actions[bot]

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Jun 16 '24 04:06 github-actions[bot]