mod_h2 icon indicating copy to clipboard operation
mod_h2 copied to clipboard

wrong values logged for "Bytes sent" on apache 2.4 with http/2

Open vl-grigorov opened this issue 5 years ago • 20 comments

I have noticed wrong values logged for %O (Bytes sent) on interrupted connections on apache-2.4.43 when connection is over http/2. SSL makes no difference. But on plain-text connections "%s", "%D" and "%A" are wrong too. (all variables are from apache mod_log_config)

Here are the requests:

user@userpc:~/homedir$ timeout -s KILL 5 curl --http2 https://s801.sureserver.com/testXXX.bin --output /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 29  100M   29 29.7M    0     0  6771k      0  0:00:15  0:00:04  0:00:11 6770kKilled
user@userpc:~/homedir$ timeout -s KILL 5 curl --http1.1 https://s801.sureserver.com/testXXX.bin --output /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 29  100M   29 29.0M    0     0  6371k      0  0:00:16  0:00:04  0:00:12 6370kKilled
user@userpc:~/homedir$ timeout -s KILL 5 curl --http1.0 https://s801.sureserver.com/testXXX.bin --output /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 33  100M   33 33.4M    0     0  7087k      0  0:00:14  0:00:04  0:00:10 7085kKilled
user@userpc:~/homedir$ timeout -s KILL 5 curl --http2 http://s801.sureserver.com/testXXX.bin --output /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 31  100M   31 31.0M    0     0  7348k      0  0:00:13  0:00:04  0:00:09 7346kKilled
user@userpc:~/homedir$ timeout -s KILL 5 curl --http1.1 http://s801.sureserver.com/testXXX.bin --output /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 35  100M   35 35.7M    0     0  8123k      0  0:00:12  0:00:04  0:00:08 8121kKilled
user@userpc:~/homedir$ timeout -s KILL 5 curl --http1.0 http://s801.sureserver.com/testXXX.bin --output /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 36  100M   36 36.3M    0     0  8152k      0  0:00:12  0:00:04  0:00:08 8150kKilled
user@userpc:~/homedir$ timeout -s KILL 5 curl --http2 -H "Range: bytes=0-52428800" https://s801.sureserver.com/testXXX.bin --output /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 69 50.0M   69 34.8M    0     0  7951k      0  0:00:06  0:00:04  0:00:02 7950kKilled
user@userpc:~/homedir$ timeout -s KILL 5 curl --http1.1 -H "Range: bytes=0-52428800" https://s801.sureserver.com/testXXX.bin --output /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 71 50.0M   71 35.6M    0     0  8060k      0  0:00:06  0:00:04  0:00:02 8058kKilled

Here is the LogFormat definition:

LogFormat "%V %a %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" 0 0 \"%{HTTPS}x:%{SSL_PROTOCOL}x:%{SSL_CIPHER}x\" %I %D %A" combined

And that's what I get logged:

s801.sureserver.com XXX.XXX.XXX.XXX - - [22/Jul/2020:11:25:32 +0300] "GET /testXXX.bin HTTP/2.0" 200 104857600 "-" "curl/7.62.0" 0 0 "on:TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384" 52 4977555
s801.sureserver.com XXX.XXX.XXX.XXX - - [22/Jul/2020:11:25:41 +0300] "GET /testXXX.bin HTTP/1.1" 200 33319484 "-" "curl/7.62.0" 0 0 "on:TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384" 764 4747961
s801.sureserver.com XXX.XXX.XXX.XXX - - [22/Jul/2020:11:26:20 +0300] "GET /testXXX.bin HTTP/1.0" 200 37307283 "-" "curl/7.62.0" 0 0 "on:TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384" 764 4762581
s801.sureserver.com XXX.XXX.XXX.XXX - - [01/Jan/1970:02:00:00 +0200] "GET /testXXX.bin HTTP/2.0" 200 104857600 "-" "curl/7.62.0" 0 0 "off:-:-" 0 1595406571434452
s801.sureserver.com XXX.XXX.XXX.XXX - - [22/Jul/2020:11:54:30 +0300] "GET /testXXX.bin HTTP/1.1" 200 43146056 "-" "curl/7.62.0" 0 0 "off:-:-" 92 4969537
s801.sureserver.com XXX.XXX.XXX.XXX - - [22/Jul/2020:11:54:48 +0300] "GET /testXXX.bin HTTP/1.0" 200 43091032 "-" "curl/7.62.0" 0 0 "off:-:-" 92 4993841
s801.sureserver.com XXX.XXX.XXX.XXX - - [22/Jul/2020:12:21:41 +0300] "GET /testXXX.bin HTTP/2.0" 206 52428801 "-" "curl/7.62.0" 0 0 "on:TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384" 66 4977392
s801.sureserver.com XXX.XXX.XXX.XXX - - [22/Jul/2020:12:27:26 +0300] "GET /testXXX.bin HTTP/1.1" 206 42318859 "-" "curl/7.62.0" 0 0 "on:TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384" 789 4973665

As you can see, the logged Bytes sent value match the requested size (last 2 attempts are partial download and on http/2 request logged value is the chunk size in the request).

vl-grigorov avatar Jul 22 '20 10:07 vl-grigorov

We've noticed the same issue. Can you please look into this?

famzah avatar Aug 11 '20 13:08 famzah

Sorry. I am a bit busy these days with putting bread on the table. But I can tell you that due to the architecture of h2 in Apache, these lengths have a good chance to not be perfectly aligned with what the client sees.

The response body is streamed from the h2 workers to the main connection where it is buffered and dispatched based on bandwidth and h2 priorities. The counting is done on the worker thread and will never account for the main connection handling.

Makes sense?

icing avatar Aug 13 '20 08:08 icing

Thanks, @icing. This makes sense but only partially. We can accept a small difference in the numbers due to buffering, but the real issue is:

  • we have a 100 MBytes static file
  • if the client requests the whole file and downloads only 29% from it (for example), we always log 100% of the file size (downloaded 29 MB, left 71 MB, but we logged 100 MB)
  • if the client requests 50% of the file with an HTTP range request, downloads only 71% (for example), we always log 100% of the requested file size (downloaded 36 MB, left 14 MB, but we logged 50 MB)

Thus we have two examples where the main connection streamed some data to the client but in case one we have 71 MB left and in the other case we have 14 MB left. But we logged the whole requested file size in the Apache access log.

Are the buffers between "main" and "h2" workers so huge that the h2 workers can buffer 71 MBytes of data to the main connection? Are these buffers configurable, so that we can try to lower them a lot?

famzah avatar Aug 13 '20 08:08 famzah

Well, it does not buffer 70 MB, but for static files, it buffers the file handle and counts its file size. ;)

icing avatar Aug 13 '20 11:08 icing

The file handle is passed by reference (value) from an "h2" worker to the "main" worker, and the "h2" worker simply immediately accounts the whole file size as returned by stat() in the Apache logs? Regardless of how much data is actually read() from the file by the "main" worker and sent to the actual HTTP client?

famzah avatar Aug 13 '20 11:08 famzah

That is my explanation of the issue. The h1 worker 'writes' the whole file and all of the file to the h2 main connection processing and the logging module records it.

icing avatar Aug 13 '20 11:08 icing

You can test this by setting:

H2CopyFiles on

and the number should more closely reflect that actual data sent. However you will need more H2Worker threads and they stay allocated until a transfer is almost done.

icing avatar Aug 13 '20 12:08 icing

Hi @icing , @famzah I have tried

H2CopyFiles on

and indeed it changed the numbers, but only for cleartext (h2c) protocol they are close to the expected values (I enabled it for the tests). For h2 protocol (http/2+tls) every interrupted download was logged with 32768 for bytes sent, regardless how much data client has downloaded. For example client downloads about 200MB and in the log we have this:

s801.sureserver.com XXX.XXX.XXX.XXX - - [18/Aug/2020:03:11:28 -0400] "GET /testXXX.bin HTTP/2.0" 200 32768 "-" "curl/7.52.1" 0 0 "on:TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384" 52 9981569

vl-grigorov avatar Aug 19 '20 13:08 vl-grigorov

I see. I made a change in how/when the bytes are reported. Can you build and test from the master branch here or would you prefer a release?

icing avatar Aug 20 '20 07:08 icing

hello @icing I managed to build a module from master branch (with apache 2.4.46), but unfortunately the problem is not solved. Now I found 0 for bytes sent in the log. Here is the what I've got logged for 2 https requests (first with "H2CopyFiles off" and second one with "H2CopyFiles on"):

www.XXX.com XXX.XXX.XXX.XXX - - [20/Aug/2020:06:41:38 -0400] "GET /testXXX.bin HTTP/2.0" 200 0 "-" "curl/7.52.1" 0 0 "on:TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384" 56 2977041
www.XXX.com XXX.XXX.XXX.XXX - - [20/Aug/2020:06:47:37 -0400] "GET /testXXX.bin HTTP/2.0" 200 0 "-" "curl/7.52.1" 0 0 "on:TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384" 56 2980883

I can test new patches if available.

vl-grigorov avatar Aug 20 '20 10:08 vl-grigorov

Hey, @vl-grigorov, thanks for testing this. It seems the reporting was done too late and the access log entry had already been written.

I made another approach in master which shows up the lengths in my log. It should work the same for both H2CopyFiles on/off. I hope this gives a good enough approximation of the actual data sent.

icing avatar Aug 20 '20 11:08 icing

Now I see 0 for both partial and complete download. :( Will try to figure out what's wrong after hour or two. Your tests are with %O (Bytes sent), right?

vl-grigorov avatar Aug 20 '20 12:08 vl-grigorov

Sorry, about that. I think I need a better time slot on my side where I can test this more thoroughly. My test setup uses:

    LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" %k" combined
    CustomLog "logs/access_log" combined

icing avatar Aug 20 '20 12:08 icing

Hi @icing Sorry for the delay. I finaly managed to spend some time in testing but with no success. Still same result: 0 for both partial and full download. I tried to replace %О with %B and in this case I see file size for both partial and full download.

vl-grigorov avatar Aug 30 '20 20:08 vl-grigorov

Just to clarify: when you see the %B sizes, are they approximately what you'd exepct?

Ah, I see in the code that %B and %O are totally different counters and that %B will most likely count the whole file length. So, still to determine why %O does not update in your setup.

icing avatar Sep 01 '20 07:09 icing

Since this requires extensive testing, I suggest that we work with the same configuration and scripts.

Here is what I've configured in Apache (a separate "access_log", so that we don't disturb any production data):

Protocols h2 h2c http/1.1
#H2CopyFiles on

LogFormat "\"%r\" http_status=%>s(%s) bytes_resp_B=%B(%b) bytes_tx_O=%O bytes_rx_I=%I bytes_rx_tx_S=%S time_req_us=%D \"%{HTTPS}x:%{SSL_PROTOCOL}x\"" mod_h2_issue_203
CustomLog /var/log/access.mod_h2_issue_203 mod_h2_issue_203

Here is the script that I use to do the tests: http2_test.txt

Sample usage:

http2_test example.com/100M.bin

The script downloads about 20 MBytes for each request. I run it locally on the testing server to avoid network jitter. The script is designed not to leak any sensitive data like machine hostname, etc.

famzah avatar Sep 07 '20 14:09 famzah

Here are my results with the latest mod_h2 build with Apache/2.4.46:

"GET /100M.bin?test_id=1599490650&_http2_whole_ssl___ HTTP/2.0" http_status=200(200) bytes_resp_B=104857600(104857600) bytes_tx_O=0 bytes_rx_I=79 bytes_rx_tx_S=79 time_req_us=5078892 "on:TLSv1.2"
"GET /100M.bin?test_id=1599490650&http11_whole_ssl___ HTTP/1.1" http_status=200(200) bytes_resp_B=104857600(104857600) bytes_tx_O=31027282 bytes_rx_I=802 bytes_rx_tx_S=31028084 time_req_us=5071547 "on:TLSv1.2"
"GET /100M.bin?test_id=1599490650&http10_whole_ssl___ HTTP/1.0" http_status=200(200) bytes_resp_B=104857600(104857600) bytes_tx_O=31023909 bytes_rx_I=802 bytes_rx_tx_S=31024711 time_req_us=5068077 "on:TLSv1.2"

"GET /100M.bin?test_id=1599490650&_http2_whole_nonssl HTTP/1.1" http_status=101(101) bytes_resp_B=0(-) bytes_tx_O=34452707 bytes_rx_I=306 bytes_rx_tx_S=34453013 time_req_us=5055255 "off:-"
#
"GET /100M.bin?test_id=1599490650&_http2_whole_nonssl HTTP/2.0" http_status=200(200) bytes_resp_B=104857600(104857600) bytes_tx_O=0 bytes_rx_I=0 bytes_rx_tx_S=0 time_req_us=1599490671306927 "off:-"
"GET /100M.bin?test_id=1599490650&http11_whole_nonssl HTTP/1.1" http_status=200(200) bytes_resp_B=104857600(104857600) bytes_tx_O=29947619 bytes_rx_I=154 bytes_rx_tx_S=29947773 time_req_us=5069227 "off:-"
"GET /100M.bin?test_id=1599490650&http10_whole_nonssl HTTP/1.0" http_status=200(200) bytes_resp_B=104857600(104857600) bytes_tx_O=30929864 bytes_rx_I=154 bytes_rx_tx_S=30930018 time_req_us=5053866 "off:-"

"GET /100M.bin?test_id=1599490650&_http2_half_ssl____ HTTP/2.0" http_status=206(206) bytes_resp_B=52428801(52428801) bytes_tx_O=0 bytes_rx_I=93 bytes_rx_tx_S=93 time_req_us=5075063 "on:TLSv1.2"
"GET /100M.bin?test_id=1599490650&http11_half_ssl____ HTTP/1.1" http_status=206(206) bytes_resp_B=52428801(52428801) bytes_tx_O=31023957 bytes_rx_I=827 bytes_rx_tx_S=31024784 time_req_us=5067253 "on:TLSv1.2"

For all requests "time_req_us" indicates that the request indeed finishes immediately when the HTTP client disconnects.

For HTTP/1.1 and HTTP/1.0 via both SSL and non-SSL we have the following:

  • ok: %O and %S (bytes actually sent as response)
  • ok: %B (intended response size)

For HTTP/2 via both SSL and non-SSL we have the following:

  • bad: %O and %S (bytes actually sent as response); %O (bytes sent) always shows "0 bytes" and %S accounts only the rx bytes %I (as expected)
  • ok: %B (intended response size)

The access log for the "HTTP/2 via non-SSL" has two entries and maybe shows some insight, because the HTTP/1.1 logs correctly %O:

"GET /100M.bin?test_id=1599490650&_http2_whole_nonssl HTTP/1.1" http_status=101(101) bytes_resp_B=0(-) bytes_tx_O=34452707 bytes_rx_I=306 bytes_rx_tx_S=34453013 time_req_us=5055255 "off:-"
"GET /100M.bin?test_id=1599490650&_http2_whole_nonssl HTTP/2.0" http_status=200(200) bytes_resp_B=104857600(104857600) bytes_tx_O=0 bytes_rx_I=0 bytes_rx_tx_S=0 time_req_us=1599490671306927 "off:-"

FTR, because nobody uses "h2c" (HTTP/2 via non-SSL), I can skip those tests if they bring no value. Just say so.

famzah avatar Sep 07 '20 15:09 famzah

Tried to reproduce your settings. In your LogFormat I download the first 20001 bytes of a 1m resource and am seeing the difference in numbers:

"GET /files/data-1m HTTP/2.0" http_status=206(206) bytes_resp_B=20001(20001) bytes_tx_O=0 bytes_rx_I=73 bytes_rx_tx_S=73 time_req_us=3055 "on:TLSv1.2"
"GET /files/data-1m HTTP/1.1" http_status=206(206) bytes_resp_B=20001(20001) bytes_tx_O=22017 bytes_rx_I=495 bytes_rx_tx_S=22512 time_req_us=342 "on:TLSv1.2"

Making a test case for this and then find the fault...

icing avatar Sep 09 '20 13:09 icing

Hello. I've finally tried this in production and it works only if I enable H2CopyFiles.

If H2CopyFiles is off (default), then partial downloads are logged in as 0 bytes.

Full downloads are properly logged regardless of the H2CopyFiles setting.

Did you test with H2CopyFiles "on" or "off" (default)?

famzah avatar Dec 02 '20 18:12 famzah

Hmm, I believe that was without H2CopyFiles. But I will check this again.

icing avatar Dec 17 '20 16:12 icing