Low throughput + NetEm delay creates gaps in upload data

Open upnix opened this issue 4 years ago • 1 comments

The problem: In Mininet, when limiting link speed to 10Mbps (via TBF or NetEm) and adding any amount of delay with NetEm, Flent using Netperf+TCP_STREAM will return large gaps in upload data - both in CSV output and resulting charts. While Netperf acts strangely in this scenario (which I'll describe below), I believe it is Flent and the use of apply_to in the DATA_SETS data structure that causes this problem.

The setup:

Mininet 2.3.0, installed from Github
Flent 2.0.1 installed with pip3 install flent
All on the same Ubuntu 20.0.4.4 install, directly on hardware (no VM).

With a network configuration of 1 router, 2 subnets, and 2 hosts (h1, h2), I use TBF to rate limit all links to 10Mbit/s, and NetEm to add ~28ms of delay between hosts (7ms on each link, but any amount of delay will do). I run Netserver on host h2, and the Flent test on h1, with traffic crossing the router. I'll attach my configuration files.

Commands:

$ sudo python3 ~/mininet_networks/1Router_2Networks_3Hosts.py
mininet> h2 pkill netserver
mininet> h2 netserver
mininet> h1 ethtool -K h1-eth0 tso off gso off gro off
mininet> h2 ethtool -K h2-eth0 tso off gso off gro off
mininet> h3 ethtool -K h3-eth0 tso off gso off gro off
mininet> r0 ethtool -K r0-eth1 tso off gso off gro off
mininet> r0 ethtool -K r0-eth2 tso off gso off gro off
mininet> r0 tc qdisc add dev r0-eth1 root tbf rate 10mbit burst 4096kbit latency 5ms
mininet> r0 tc qdisc add dev r0-eth2 root tbf rate 10mbit burst 4096kbit latency 5ms
mininet> r0 tc qdisc add dev r0-eth1 parent 8001: netem delay 7ms
mininet> r0 tc qdisc add dev r0-eth2 parent 8002: netem delay 7ms
mininet> h1 tc qdisc add dev h1-eth0 root tbf rate 10mbit burst 4096kbit latency 5ms
mininet> h1 tc qdisc add dev h1-eth0 parent 8005: netem delay 7ms
mininet> h2 tc qdisc add dev h2-eth0 root tbf rate 10mbit burst 4096kbit latency 5ms
mininet> h2 tc qdisc add dev h2-eth0 parent 8007: netem delay 7ms
mininet> h1 flent -H 10.0.0.100 -x --socket-stats -d 0 -l 60 tcp_2up -f csv -D ~chris/ -t 'TCP 2 Up ' -o ~chris/tcp_2up.csv

The result: There are large gaps in the results reported by Flent.

Narrowing the problem down Above, I showed the problem with the Flent-included tcp_2up test, but because I believe the issue lies with the use of apply_to I had to do some retooling of the test to exclude its use. So I have two new test configurations:

tcp_nup_2.conf - This is the Flent-included tcp_nup.conf, modified by commenting out the function add_stream, the call to for_stream_config() and the DATA_SETS entry "TCP upload avg". I then hard-code in what is essentially a single "TCP upload::1" test.
tcp_1up_from_nup_2.conf - This is tcp_2up.conf, but it includes tcp_nup_2.conf instead of tcp_nup.conf

Now, running the Flent test tcp_1up_from_nup_2.conf, upload data is shown as continuous, as you'd expect.

Why? I don't know. What I do know is that the Flent test tcp_2down has no problems, and when I run the related Netperf command directly, TCP_MAERTS will return results with with expected regularity (NETPERF_INTERVAL[xx]=0.2 more or less). However, the Netperf test TCP_STREAM, which tcp_2up uses will have spaces between results of 4 seconds (NETPERF_INTERVAL[xx]=4 more or less). The results returned still seem accurate to me, there's just longer pauses between reporting.

But this can't be the entire story, because Flent tests that don't use apply_to when building DATA_SETS use the exact same Netperf command, gaps and all, yet don't have this problem.

So it would seem to me that somehow Flent isn't properly handling gaps in reporting when apply_to is used for DATA_SETS.

What else fixes the problem?

Removing any delay on the link, whether from removing NetEm or going to a hardware switch.
Increasing the link speed, but keeping the delay.
Changing the TCP CCA to BBR, rather than CUBIC.

Note that these are probably things that just make Netperf return results every 0.2 seconds (I haven't checked though), so they're probably not directly related to Flent.

Files of interest Flent results when running the included tcp_2up test: tcp_2up-2022-04-22T095700.876743.TCP_2_Up.flent.gz

My Flent test that avoids gaps in upload data: tcp_1up_from_nup_2.txt tcp_nup_2.txt

The Mininet network used: 1Router_2Networks_3Hosts.txt

Apr 22 '22 16:04 upnix

Chris Cameron @.***> writes:

The problem: In Mininet, when limiting link speed to 10Mbps (via TBF or NetEm) and adding any amount of delay with NetEm, Flent using Netperf+TCP_STREAM will return large gaps in upload data - both in CSV output and resulting charts. While Netperf acts strangely in this scenario (which I'll describe below), I believe it is Flent and the use of apply_to in the DATA_SETS data structure that causes this problem.

So you're kinda right that the problem is caused by an interaction between netperf's behaviour and the Flent series computation (for certain series). Specifically, this is what happens:

At really low bandwidths, netperf will miss its data point output deadline, which causes data points to be spread out - by a lot, as you've noticed. This is most pronounced for upstream (TCP_STREAM) netperf flows.
Flent can plot these "sparse" data points just fine; however, the synthetic (computed after the fact) data series, i.e., the "average" and "total" bandwidth series, suffer.

The reason for the latter is the way Flent computes the synthetic data: it will try to generate a synthetic data point at every 'step size' interval, by linearly interpolating the points on both sides. E.g., if netperf outputs data points at t=0.198 and t=0.398, it'll interpolate between those to generate a synthetic data point at t=0.2. This will happen for each series, and the sum or average computation is done on those synthetic data points that are all aligned to the step size intervals.

The problem you're seeing happens because there's a maximum interpolation distance (of five times the step size), and if the data points are further apart than this, no interpolation will be done and you'll get gaps in the synthetic series.

Now, as for the question about what can be done about it, I'm afraid that (in my opinion) the answer turns out to be "not much". Because the fundamental problem here is that we're trying to compute a value that's not really well-defined, because we're dealing with a bunch of timeseries values.

I.e., as an example, if there are two instances of netperf running, series A outputs data points at t=1, 4, and 7 seconds, and series B outputs data points at t=3, 6 and 9 seconds, how are you really going to tell what the average throughput at t=2 seconds was?

(That's a serious question, BTW, if you have an idea for a better algorithm for interpolating data points, or just computing the synthetic series in a different way, I'm all ears).

As a workaround you could try increasing the step size; this should make the error in netperf's data output relatively smaller (since they tend to stay relatively constant in absolute values), which may help get rid of the gaps...

Apr 25 '22 21:04 tohojo