firecracker icon indicating copy to clipboard operation
firecracker copied to clipboard

Add integration test for memory performance guest vs host

Open goandrei opened this issue 5 years ago • 8 comments

Add a test case which will check if memory performance of the guest vs host. Both guest and host will be tested having the same shape for accurate results.

goandrei avatar Mar 04 '20 08:03 goandrei

I came across this memory bandwidth test: http://www.cs.virginia.edu/stream/

I don't know if it makes sense to use it in Firecracker, but it's worth investigating.

serban300 avatar Mar 05 '20 10:03 serban300

I am trying to understand how stable are the results between runs (both host and guest).

Here is the command that I used for compiling stream.c: gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=17500000 -DNTIMES=10 stream.c -o stream

I wrote a small script for getting the variation between runs:

#!/bin/bash

NUM_OPS=4
NUM_RUNS=1000000000

declare -a MIN
declare -a MAX
for ((i=0; i<$NUM_OPS;i++))
do
   MIN[$i]=1000000
   MAX[$i]=0
done

declare -a OPS
OPS[0]="Copy"
OPS[1]="Scale"
OPS[2]="Add"
OPS[3]="Triad"

declare -a RESULTS
declare -a MEAN
for ((i=0; i<$NUM_OPS;i++))
do
    MEAN[$i]=0
done


for ((run=1; run<=$NUM_RUNS;run++))
do
    STREAM=`./stream`
    for ((i=0; i<$NUM_OPS;i++))
    do
        RESULTS[$i]=$(echo $STREAM | grep -Po "${OPS[$i]}:( +)\K[^ ]+")
        MEAN[$i]=$(echo "${MEAN[$i]} + (${RESULTS[$i]} - ${MEAN[$i]})/$run" | bc -l)
    done

    for ((i=0; i<$NUM_OPS;i++))
    do
        if (( $(echo "${RESULTS[$i]} < ${MIN[$i]}" | bc -l) )); then
            MIN[$i]=${RESULTS[$i]}
        fi
        if (( $(echo "${RESULTS[$i]} > ${MAX[$i]}" | bc -l) )); then
            MAX[$i]=${RESULTS[$i]}
        fi
    done

    echo "RUN: $run" 
    for ((i=0; i<$NUM_OPS;i++))
    do
        DIFF=$(echo "(${MAX[$i]} - ${MIN[$i]})" | bc -l)
        DIFF=$(echo "(100 * $DIFF) / ${MAX[$i]}" | bc -l)
        printf "${OPS[$i]}:\t%.2f    %.2f    %.2f    %.2f%s\n" ${MIN[$i]} ${MAX[$i]} ${MEAN[$i]} $DIFF '%'
    done
    echo ""

    sleep 60
done

serban300 avatar Apr 15 '20 15:04 serban300

I tried to run the script above on a set of 16 cpus that are shielded from the linux scheduler.

Here is what I did for shielding the cpus:

sudo cset shield --cpu 72-87
sudo cset shield --kthread on
sudo cset set --mem=1 --set=user

Here are some results:

Host:

~# sudo cset shield --exec perf stat -- ./run_stream.sh
...
RUN: 50
Copy:	88658.00    93153.50    90460.80    4.83%
Scale:	68651.50    72150.90    70440.51    4.85%
Add:	77167.90    83011.20    80723.53    7.04%
Triad:	77112.30    81863.60    79499.42    5.80%


 Performance counter stats for './run_stream.sh':

    1432660.656054      task-clock (msec)         #    9.892 CPUs utilized          
              8511      context-switches          #    0.006 K/sec                  
               279      cpu-migrations            #    0.000 K/sec                  
          10425314      page-faults               #    0.007 M/sec                  
     4532611564056      cycles                    #    3.164 GHz                    
      992421204712      instructions              #    0.22  insn per cycle         
      138884916514      branches                  #   96.942 M/sec                  
          54009228      branch-misses             #    0.04% of all branches        

     144.832597192 seconds time elapsed

Guest:

~# ./perf/perf stat ./run_stream.sh
...
RUN: 50
Copy:	87515.40    91027.70    89640.03    3.86%
Scale:	70375.10    72453.40    71566.65    2.87%
Add:	79145.10    82926.70    81279.99    4.56%
Triad:	79134.20    81272.60    80376.67    2.63%


 Performance counter stats for './run_stream.sh':

    1408020.226366      task-clock (msec)         #    9.678 CPUs utilized          
             10110      context-switches          #    0.007 K/sec                  
               520      cpu-migrations            #    0.000 K/sec                  
           5419113      page-faults               #    0.004 M/sec                  
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               

     145.483396391 seconds time elapsed

perf kvm stat ran from the host while the guest was running the test:

#~ sudo cset shield --exec perf kvm -- --host --guest stat ./start_firecracker.sh
...
 Performance counter stats for './start_firecracker.sh':

    1416306.255328      task-clock:HG (msec)      #    8.147 CPUs utilized          
             75640      context-switches:HG       #    0.053 K/sec                  
               480      cpu-migrations:HG         #    0.000 K/sec                  
             42390      page-faults:HG            #    0.030 K/sec                  
     4467166928992      cycles:HG                 #    3.154 GHz                    
     1024499397546      instructions:HG           #    0.23  insn per cycle         
      147463380118      branches:HG               #  104.118 M/sec                  
         149773583      branch-misses:HG          #    0.10% of all branches        

     173.843728512 seconds time elapsed

It's strange that in some cases the results on the guest are better then the ones on the host. The same behavior can be observed for a single vcpu.

One thing that stands out from the perf results is that the number of page-faults is far smaller on the guest then on the host.

serban300 avatar Apr 16 '20 14:04 serban300

I recompiled the Stream binary and now I'm getting a similar number of page faults on host vs guest. This doesn't seem to be the issue.

serban300 avatar Apr 21 '20 09:04 serban300

I ran the script on 8 cpus (host) vs 8 vcpus (guest) overnight. I used CPU shielding for 8 cpus belonging to the same numa node. Here is the final iteration:

Host:

RUN: 1302
Copy:	69188.50    71501.10    70052.59    3.23%
Scale:	57506.90    62428.50    60140.94    7.88%
Add:	69420.20    73645.80    71647.42    5.74%
Triad:	69595.80    72214.80    70891.60    3.63%

Guest:

RUN: 1372
Copy:	68176.30    70458.70    69288.33    3.24%
Scale:	58626.50    62708.50    61264.49    6.51%
Add:	69023.10    73156.50    71686.87    5.65%
Triad:	68268.80    71709.20    70581.14    4.80%

For Scale and Add the results were better on the guest. We need to understand why this is happening and if it's normal.

serban300 avatar Apr 23 '20 09:04 serban300

I reran the script with NTP synchronization in the guest. Also I modified the script in order to sync the clock before starting the runs:

sudo service ntp stop
sudo ntpd -gq
sudo service ntp start

Now the averages on the host are always better than the averages on the guest:

Host:

RUN: 1339
Copy:	69598.50    71120.00    70362.82    2.14%
Scale:	59584.20    61068.30    60216.76    2.43%
Add:	70718.90    72449.40    71433.56    2.39%
Triad:	69513.40    71104.20    70147.77    2.24%

Guest

RUN: 1339
Copy:	66921.50    70172.40    68823.63    4.63%
Scale:	57015.50    59134.20    58286.53    3.58%
Add:	68414.60    71089.90    69954.42    3.76%
Triad:	68160.50    70495.30    69455.27    3.31%

But still the max on the guest can be better than the min on the host. We need to understand how to account for this and how many runs we need in order to get the most statistically relevant results. Also we need to understand if the variations can get even bigger. I will let the script running for more time and see when the variations stop increasing.

serban300 avatar Apr 27 '20 11:04 serban300

Why not use 1588/PTP? It will give much better syncronization

raduiliescu avatar Apr 27 '20 12:04 raduiliescu

I'll also try PTP

serban300 avatar Apr 28 '20 15:04 serban300

We've discussed this with the team today, and came to the conclusion that we do not see much value in such a test. Outside of reads/writes of the MMIO region, Firecracker is not included in any memory access paths, so at best such a test would exercise KVM paths (e.g. setting up extended page tables, dirty page tracking), and at worst we would only be profiling the hardware MMU, and such benchmarks are our of scope for Firecracker.

roypat avatar Mar 04 '24 11:03 roypat