Add integration test for memory performance guest vs host
Add a test case which will check if memory performance of the guest vs host. Both guest and host will be tested having the same shape for accurate results.
I came across this memory bandwidth test: http://www.cs.virginia.edu/stream/
I don't know if it makes sense to use it in Firecracker, but it's worth investigating.
I am trying to understand how stable are the results between runs (both host and guest).
Here is the command that I used for compiling stream.c:
gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=17500000 -DNTIMES=10 stream.c -o stream
I wrote a small script for getting the variation between runs:
#!/bin/bash
NUM_OPS=4
NUM_RUNS=1000000000
declare -a MIN
declare -a MAX
for ((i=0; i<$NUM_OPS;i++))
do
MIN[$i]=1000000
MAX[$i]=0
done
declare -a OPS
OPS[0]="Copy"
OPS[1]="Scale"
OPS[2]="Add"
OPS[3]="Triad"
declare -a RESULTS
declare -a MEAN
for ((i=0; i<$NUM_OPS;i++))
do
MEAN[$i]=0
done
for ((run=1; run<=$NUM_RUNS;run++))
do
STREAM=`./stream`
for ((i=0; i<$NUM_OPS;i++))
do
RESULTS[$i]=$(echo $STREAM | grep -Po "${OPS[$i]}:( +)\K[^ ]+")
MEAN[$i]=$(echo "${MEAN[$i]} + (${RESULTS[$i]} - ${MEAN[$i]})/$run" | bc -l)
done
for ((i=0; i<$NUM_OPS;i++))
do
if (( $(echo "${RESULTS[$i]} < ${MIN[$i]}" | bc -l) )); then
MIN[$i]=${RESULTS[$i]}
fi
if (( $(echo "${RESULTS[$i]} > ${MAX[$i]}" | bc -l) )); then
MAX[$i]=${RESULTS[$i]}
fi
done
echo "RUN: $run"
for ((i=0; i<$NUM_OPS;i++))
do
DIFF=$(echo "(${MAX[$i]} - ${MIN[$i]})" | bc -l)
DIFF=$(echo "(100 * $DIFF) / ${MAX[$i]}" | bc -l)
printf "${OPS[$i]}:\t%.2f %.2f %.2f %.2f%s\n" ${MIN[$i]} ${MAX[$i]} ${MEAN[$i]} $DIFF '%'
done
echo ""
sleep 60
done
I tried to run the script above on a set of 16 cpus that are shielded from the linux scheduler.
Here is what I did for shielding the cpus:
sudo cset shield --cpu 72-87
sudo cset shield --kthread on
sudo cset set --mem=1 --set=user
Here are some results:
Host:
~# sudo cset shield --exec perf stat -- ./run_stream.sh
...
RUN: 50
Copy: 88658.00 93153.50 90460.80 4.83%
Scale: 68651.50 72150.90 70440.51 4.85%
Add: 77167.90 83011.20 80723.53 7.04%
Triad: 77112.30 81863.60 79499.42 5.80%
Performance counter stats for './run_stream.sh':
1432660.656054 task-clock (msec) # 9.892 CPUs utilized
8511 context-switches # 0.006 K/sec
279 cpu-migrations # 0.000 K/sec
10425314 page-faults # 0.007 M/sec
4532611564056 cycles # 3.164 GHz
992421204712 instructions # 0.22 insn per cycle
138884916514 branches # 96.942 M/sec
54009228 branch-misses # 0.04% of all branches
144.832597192 seconds time elapsed
Guest:
~# ./perf/perf stat ./run_stream.sh
...
RUN: 50
Copy: 87515.40 91027.70 89640.03 3.86%
Scale: 70375.10 72453.40 71566.65 2.87%
Add: 79145.10 82926.70 81279.99 4.56%
Triad: 79134.20 81272.60 80376.67 2.63%
Performance counter stats for './run_stream.sh':
1408020.226366 task-clock (msec) # 9.678 CPUs utilized
10110 context-switches # 0.007 K/sec
520 cpu-migrations # 0.000 K/sec
5419113 page-faults # 0.004 M/sec
<not supported> cycles
<not supported> instructions
<not supported> branches
<not supported> branch-misses
145.483396391 seconds time elapsed
perf kvm stat ran from the host while the guest was running the test:
#~ sudo cset shield --exec perf kvm -- --host --guest stat ./start_firecracker.sh
...
Performance counter stats for './start_firecracker.sh':
1416306.255328 task-clock:HG (msec) # 8.147 CPUs utilized
75640 context-switches:HG # 0.053 K/sec
480 cpu-migrations:HG # 0.000 K/sec
42390 page-faults:HG # 0.030 K/sec
4467166928992 cycles:HG # 3.154 GHz
1024499397546 instructions:HG # 0.23 insn per cycle
147463380118 branches:HG # 104.118 M/sec
149773583 branch-misses:HG # 0.10% of all branches
173.843728512 seconds time elapsed
It's strange that in some cases the results on the guest are better then the ones on the host. The same behavior can be observed for a single vcpu.
One thing that stands out from the perf results is that the number of page-faults is far smaller on the guest then on the host.
I recompiled the Stream binary and now I'm getting a similar number of page faults on host vs guest. This doesn't seem to be the issue.
I ran the script on 8 cpus (host) vs 8 vcpus (guest) overnight. I used CPU shielding for 8 cpus belonging to the same numa node. Here is the final iteration:
Host:
RUN: 1302
Copy: 69188.50 71501.10 70052.59 3.23%
Scale: 57506.90 62428.50 60140.94 7.88%
Add: 69420.20 73645.80 71647.42 5.74%
Triad: 69595.80 72214.80 70891.60 3.63%
Guest:
RUN: 1372
Copy: 68176.30 70458.70 69288.33 3.24%
Scale: 58626.50 62708.50 61264.49 6.51%
Add: 69023.10 73156.50 71686.87 5.65%
Triad: 68268.80 71709.20 70581.14 4.80%
For Scale and Add the results were better on the guest. We need to understand why this is happening and if it's normal.
I reran the script with NTP synchronization in the guest. Also I modified the script in order to sync the clock before starting the runs:
sudo service ntp stop
sudo ntpd -gq
sudo service ntp start
Now the averages on the host are always better than the averages on the guest:
Host:
RUN: 1339
Copy: 69598.50 71120.00 70362.82 2.14%
Scale: 59584.20 61068.30 60216.76 2.43%
Add: 70718.90 72449.40 71433.56 2.39%
Triad: 69513.40 71104.20 70147.77 2.24%
Guest
RUN: 1339
Copy: 66921.50 70172.40 68823.63 4.63%
Scale: 57015.50 59134.20 58286.53 3.58%
Add: 68414.60 71089.90 69954.42 3.76%
Triad: 68160.50 70495.30 69455.27 3.31%
But still the max on the guest can be better than the min on the host. We need to understand how to account for this and how many runs we need in order to get the most statistically relevant results. Also we need to understand if the variations can get even bigger. I will let the script running for more time and see when the variations stop increasing.
Why not use 1588/PTP? It will give much better syncronization
I'll also try PTP
We've discussed this with the team today, and came to the conclusion that we do not see much value in such a test. Outside of reads/writes of the MMIO region, Firecracker is not included in any memory access paths, so at best such a test would exercise KVM paths (e.g. setting up extended page tables, dirty page tracking), and at worst we would only be profiling the hardware MMU, and such benchmarks are our of scope for Firecracker.