address: lazily create string representation of an IP address
Commit Message: The current Address implementation will create the string representation of an IP address upon construction. This is probably okay for TCP, but for each UDP packet it's a lot of CPU cost. This change will lazy create the string representation of an IP address if it is created by Address::InstanceFactory.
Additional Description: for address instances created without factory, they will still keep the old behavior for lack of efficient way to runtime guard the new behavior.
bazel run test/common/network:address_impl_speed_test_benchmark_test -c opt gives significant diff with and without lazy creation of string form.
Lazy creation
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate 981570 ns 981351 ns 1
ipv4InstanceCreateViaFactory 619560 ns 619751 ns 1
ipv6InstanceCreate 2475770 ns 2476094 ns 1
ipv6InstanceCreateViaFactory 510919 ns 510851 ns 1
No lazy creation
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate 866841 ns 866243 ns 1
ipv4InstanceCreateViaFactory 1048951 ns 1049241 ns 1
ipv6InstanceCreate 2424031 ns 2424304 ns 1
ipv6InstanceCreateViaFactory 2900400 ns 2900675 ns 1
Before this change
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate 650960 ns 650641 ns 1
ipv4InstanceCreateViaFactory 863200 ns 863162 ns 1
ipv6InstanceCreate 2144770 ns 2144933 ns 1
ipv6InstanceCreateViaFactory 2512201 ns 2512295 ns 1
Risk Level: low Testing: added new unit test and benchmark tests Docs Changes: N/A Release Notes: yes Platform Specific Features: N/A Runtime guard: envoy.restart_features.address_factory_lazy_create_address_string
CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).
Assigning Ryan for a first-pass. /assign @RyanTheOptimist
Assigning !Googler reviewer /assign @ggreenway
/retest
/retest
LGTM. Are the benchmarks in the PR description accurate with the new implementation?
Updated
LGTM. Are the benchmarks in the PR description accurate with the new implementation?
Updated
Hm. The new benchmarks suggest that the Create() methods are now slower but the CreateViaFactory() methods are faster. Is this really a win? Is one of those methods more popular than the other in the codebase?
LGTM. Are the benchmarks in the PR description accurate with the new implementation?
Updated
Hm. The new benchmarks suggest that the
Create()methods are now slower but theCreateViaFactory()methods are faster. Is this really a win? Is one of those methods more popular than the other in the codebase?
Sorry, I didn't run the no lazy creation benchmark. Updated it as well. Create() shouldn't have any difference with flag on and off as the change only takes effect in CreateViaFactory() . But all creation indeed does the extra work to initialize the std::once_flag now. So if comparing to the old implementation Create() is slightly more expensive but within the same magnitude. And the numbers differ slightly between different runs.
LGTM. Are the benchmarks in the PR description accurate with the new implementation?
Updated
Hm. The new benchmarks suggest that the
Create()methods are now slower but theCreateViaFactory()methods are faster. Is this really a win? Is one of those methods more popular than the other in the codebase?Sorry, I didn't run the no lazy creation benchmark. Updated it as well.
Create()shouldn't have any difference with flag on and off as the change only takes effect inCreateViaFactory(). But all creation indeed does the extra work to initialize the std::once_flag now. So if comparing to the old implementationCreate()is slightly more expensive but within the same magnitude. And the numbers differ slightly between different runs.
| Base | Lazy | Delta | % Change | No Lazy | Delta | % Change | |
|---|---|---|---|---|---|---|---|
| ipv4InstanceCreate | 650641 | 981351 | 330710 | 50.83% | 866243 | 215602 | 33.14% |
| ipv4InstanceCreateViaFactory | 863162 | 619751 | -243411 | -28.20% | 1049241 | 186079 | 21.56% |
| ipv6InstanceCreate | 2144933 | 2476094 | 331161 | 15.44% | 2424304 | 279371 | 13.02% |
| ipv6InstanceCreateViaFactory | 2512295 | 510851 | -2001444 | -79.67% | 2900675 | 388380 | 15.46% |
I dumped the data into a spreadsheet so I could compare them quantitatively. It looks like this is a 15%-30% regression in all cases when the flag is false ("No Lazy"). It looks like a 15-50% regression to the Create() method when the flag is true. These both seem problematic?
The benchmark do suggest 30-80%(!!!!) improvement in the ViaFactory() methods, so that's great.
Am I interpreting the data correctly? If so, I'm not sure about how to balance the regression in the Create() (and flag-false) cases with the improvement in the ViaFactory() method.
Base Lazy Delta % Change No Lazy Delta % Change ipv4InstanceCreate 650641 981351 330710 50.83% 866243 215602 33.14% ipv4InstanceCreateViaFactory 863162 619751 -243411 -28.20% 1049241 186079 21.56% ipv6InstanceCreate 2144933 2476094 331161 15.44% 2424304 279371 13.02% ipv6InstanceCreateViaFactory 2512295 510851 -2001444 -79.67% 2900675 388380 15.46% I dumped the data into a spreadsheet so I could compare them quantitatively. It looks like this is a 15%-30% regression in all cases when the flag is false ("No Lazy"). It looks like a 15-50% regression to the Create() method when the flag is true. These both seem problematic?
There shouldn't be any difference with Create() between lazy and no-lazy runs. They both instantiate std::once_flag and convert the socket address into string. The regression difference between these two (50% v.s. 33%) maybe just noise between 2 runs of the benchmark tests. But from the last column there is 15~30% regression in Create() with the new implementation if not lazily do the string conversion due to std::once_flag.
It could be that some of the degradation in vs the base case is because of increased size of the objects due to adding the once_flag
It could be that some of the degradation in vs the base case is because of increased size of the objects due to adding the
once_flag
Yeah, that sounds possible. @danzh2010 can you run the test with more iterations to see if the numbers are consistent?
Old code
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate 656954 ns 656911 ns 10000
ipv4InstanceCreateViaFactory 868333 ns 868294 ns 10000
ipv6InstanceCreate 2213465 ns 2213259 ns 10000
ipv6InstanceCreateViaFactory 2580065 ns 2579853 ns 10000
No Lazy
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate 840328 ns 840277 ns 10000
ipv4InstanceCreateViaFactory 1007456 ns 1007379 ns 10000
ipv6InstanceCreate 2387824 ns 2387553 ns 10000
ipv6InstanceCreateViaFactory 2867409 ns 2867041 ns 10000
Lazy
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
ipv4InstanceCreate 849292 ns 849270 ns 10000
ipv4InstanceCreateViaFactory 499743 ns 499724 ns 10000
ipv6InstanceCreate 2424963 ns 2424739 ns 10000
ipv6InstanceCreateViaFactory 498451 ns 498441 ns 10000
The Create() regression is 30% for v4 and 10% for v6. v6 address uses inet_ntop() which is more expensive.
I would say if 30% regression causes noticeable performance degradation, the call site should switch to use the Address factory which can then lead to 25% improvement.
I would say if 30% regression causes noticeable performance degradation, the call site should switch to use the Address factory which can then lead to 25% improvement.
Do we have an understanding of where those call sites are? I'm also concerned that this PR is a regression when the runtime flag is off. I would if a better approach would be to consider removing the addressAsString() method and replace it with a free function that does the stringification on the fly. This method is used in a lot of places, so this would be a big change, though it might not be too hard to search and replace?
I'm just feeling uncomfortable about landing a change which regresses the flag-off performance and regresses one API while improving the performance of another API without having an understanding of the impact this has on Envoy in general.
I would say if 30% regression causes noticeable performance degradation, the call site should switch to use the Address factory which can then lead to 25% improvement.
Do we have an understanding of where those call sites are? I'm also concerned that this PR is a regression when the runtime flag is off. I would if a better approach would be to consider removing the
addressAsString()method and replace it with a free function that does the stringification on the fly. This method is used in a lot of places, so this would be a big change, though it might not be too hard to search and replace?
Why would removing addressAsString() method and replace it with a free function be helpful? If the goal is stringification on the fly, we can change the implementation addressAsString() to do that.
And because there are so many different call sites, it's really hard to audit every one of them to make sure stringification on the fly won't cause performance regression for them. Let alone extensions out there.
I'm just feeling uncomfortable about landing a change which regresses the flag-off performance and regresses one API while improving the performance of another API without having an understanding of the impact this has on Envoy in general.
Would it be okay if I make these changes in a different Address::Instance impl and only use it for UDP read?
One thing to consider is that in the UDP path (which we're trying to optimize here), we create A LOT more addresses than in TCP. My guess is for a typical http/2 request we create 1-2 addresses total. And it looks like the total time taken to do that is very small, especially relative to overall processing of an HTTP request.
Creating a new implementation of the existing Address interfaces for UDP only is an interesting idea to allow us to isolate the change to only certain parts of the code. If we're doing that, I'd recommend making the new interface always generate the string on the fly and not need any storage for it; that'll reduce the size of the object.
These sounds like great ideas! I'd definitely be curious about the perf impact on Envoy that we we see from making addressToString() do the stringification on the fly on every call. That will for sure be a win if we never call this method, which seems like the common case. Do we have load tests which could tell us this?
A UDP specific (or really a non-caching) subclass seems like also a worthy idea to explore!
I think we probably want to avoid requiring synchronization here, if we can avoid it.
I also wonder if there is anything we could do in the UDP code to create fewer address objects. What if we had, say, a map from sockaddr_in to IpAddress for recent addresses. Then we could just return the same object? Maybe this would be hard.
I also wonder if there is anything we could do in the UDP code to create fewer address objects. What if we had, say, a map from
sockaddr_intoIpAddressfor recent addresses. Then we could just return the same object? Maybe this would be hard.
I had the same thought, but it looks like if you remove the friendly string, the IP instances are just a struct sockaddr that can be memcpy'd, so it may not save anything to cache them. But it might be worth trying/investigating. The number of unique addresses we see should be in the range of 1-2 per connection usually. We might get a big win by just keeping the most recent addresses on a connection, and comparing to those on the incoming packet, and reusing if they match. Basically an LRU cache of size 1.
Another thought is instantiate the factory-created Address instance with the implementation doing lazy stringification.
I also wonder if there is anything we could do in the UDP code to create fewer address objects. What if we had, say, a map from
sockaddr_intoIpAddressfor recent addresses. Then we could just return the same object? Maybe this would be hard.I had the same thought, but it looks like if you remove the friendly string, the IP instances are just a
struct sockaddrthat can be memcpy'd, so it may not save anything to cache them. But it might be worth trying/investigating. The number of unique addresses we see should be in the range of 1-2 per connection usually. We might get a big win by just keeping the most recent addresses on a connection, and comparing to those on the incoming packet, and reusing if they match. Basically an LRU cache of size 1.
The address LRU cache will be much larger on listener side though.
The address LRU cache will be much larger on listener side though.
Oh right. I was thinking of on the connection, but we need to have already created the address to match a packet to a connection, so that doesn't work.
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!