cli icon indicating copy to clipboard operation
cli copied to clipboard

Randomly missing staging log lines after migration from v6 to v7

Open krismarc opened this issue 4 years ago • 5 comments

Dear Community,

after migration from v6 to v7 client, we are noticing missing lines at the staging part while doing push command. I had impression this could be related to the buildpack itself. However, I don't think that downloading the buildpack or the droplet uploading log lines are coming from it. You may find 'vimdiff' comparison at the bottom of this issue.

This could be somehow related to following issues: https://github.com/cloudfoundry/cli/issues/2213 https://github.com/cloudfoundry/cli/issues/2211

Current version:

# cf version
cf version 7.2.0+be4a5ce2b.2020-12-10

Missing lines:

23.24 MiB / 23.24 MiB  100.00% 2s

Waiting for API to complete processing files...

Staging app and tracing logs...
   Downloading liberty_buildpack_uat...
   Cell d8605384-990d-457d-b497-7bd84de6545a creating container for instance b40f7310-bb74-494f-a4b9-f239bb1d11e2
   Downloading app package...
   -----> Downloading IBM 1.8.0_sr6fp3001 JRE from http://wls-java-8-0-6-30-01.***.com/linux/sr-ibm-java-jre-8.0.6.30_01-linux.tar.gz ... (2.0s)
            Expanding JRE to .java ... (1.9s)
   -----> Liberty version selected: 21.0.0_0201 
   Exit status 0
   Uploading droplet...
   Uploaded build artifacts cache (331.1M)

Waiting for app main-vNext to start...

Correct lines using the same client and pushing the same app:

23.24 MiB / 23.24 MiB  100.00% 2s

Waiting for API to complete processing files...

Staging app and tracing logs...
   Downloading liberty_buildpack_uat...
   Downloaded liberty_buildpack_uat (273.8K)
   Cell e40de807-7e6c-4236-a5b0-31a209289824 creating container for instance 75ce2a55-1102-4db4-b8ea-0a01d4252d54
   Cell e40de807-7e6c-4236-a5b0-31a209289824 successfully created container for instance 75ce2a55-1102-4db4-b8ea-0a01d4252d54
   Downloading app package...
   Downloaded app package (529.3M)
   -----> Liberty Buildpack Version: v3.20-sr
   -----> Java version selected: 1.8.0_sr6fp3001 
   -----> Downloading IBM 1.8.0_sr6fp3001 JRE from http://wls-java-8-0-6-30-01.***.com/linux/sr-ibm-java-jre-8.0.6.30_01-linux.tar.gz ... (1.6s)
            Expanding JRE to .java ... (2.2s)
   -----> Liberty version selected: 21.0.0_0201 
   -----> Downloading 21.0.0_0201 from http://wls-21-0-0-2_01.***.com/sr-wlp-base-javaee-all-21.0.0.2_01.zip ... (3.4s)
            Installing archive ... (3.6s)
   -----> Liberty buildpack is done creating the droplet
   Exit status 0
   Uploading droplet, build artifacts cache...
   Uploading build artifacts cache...
   Uploading droplet...
   Uploaded build artifacts cache (331.1M)
   Uploaded droplet (850.1M)
   Uploading complete
   Cell e40de807-7e6c-4236-a5b0-31a209289824 stopping instance 75ce2a55-1102-4db4-b8ea-0a01d4252d54
   Cell e40de807-7e6c-4236-a5b0-31a209289824 destroying container for instance 75ce2a55-1102-4db4-b8ea-0a01d4252d54

Waiting for app main-vNext to start...

image

Best regards, K.M.

krismarc avatar Nov 05 '21 09:11 krismarc

The same is observed using 7.3 client. I've tried to debug it a bit in order to find out what's going on here.

Seems like in 'BAD' example (while part of logs are missing). The start_time value for log-cache API call never gets changed. It uses the same value until the process is over:

GET /api/v1/read/3477ccad-60fc-4814-b950-59a07ebdc599?envelope_types=LOG&start_time=1636459763992169697

In 'GOOD' example the client called log-cache with start_time=1636460013053292979 to get "Uploaded build artifacts cache" and start_time=1636460030794872068 to get "Uploaded droplet (388.1M)"

Both are successful. However logs are not fully retrieved from the log-cache.

BAD:

REQUEST: [2021-11-09T13:09:26+01:00]
GET /api/v1/read/3477ccad-60fc-4814-b950-59a07ebdc599?envelope_types=LOG&start_time=1636459763992169697 HTTP/1.1
Host: log-cache.appcloudtest.***.com
Authorization: [PRIVATE DATA HIDDEN]/Waiting
User-Agent: cf/7.3.0+645c3ce6a.2021-08-16 (go1.16.6; amd64 linux)


   Uploaded build artifacts cache (335.2M)
REQUEST: [2021-11-09T13:09:40+01:00]
GET /api/v1/read/3477ccad-60fc-4814-b950-59a07ebdc599?envelope_types=LOG&start_time=1636459763992169697 HTTP/1.1
Host: log-cache.appcloudtest.***.com
Authorization: [PRIVATE DATA HIDDEN]
User-Agent: cf/7.3.0+645c3ce6a.2021-08-16 (go1.16.6; amd64 linux)


ERROR: [2021-11-09T13:09:40+01:00]
Get "https://log-cache.appcloudtest.***.com/api/v1/read/3477ccad-60fc-4814-b950-59a07ebdc599?envelope_types=LOG&start_time=1636459763992169697": context c
anceled

GOOD:

REQUEST: [2021-11-09T13:13:35+01:00]
GET /api/v1/read/3477ccad-60fc-4814-b950-59a07ebdc599?envelope_types=LOG&start_time=1636460013053292979 HTTP/1.1
Host: log-cache.appcloudtest.***.com
Authorization: [PRIVATE DATA HIDDEN]
User-Agent: cf/7.3.0+645c3ce6a.2021-08-16 (go1.16.6; amd64 linux)


   Uploaded build artifacts cache (335.2M)

REQUEST: [2021-11-09T13:13:52+01:00]
GET /api/v1/read/3477ccad-60fc-4814-b950-59a07ebdc599?envelope_types=LOG&start_time=1636460030794872068 HTTP/1.1
Host: log-cache.appcloudtest.***.com
Authorization: [PRIVATE DATA HIDDEN]
User-Agent: cf/7.3.0+645c3ce6a.2021-08-16 (go1.16.6; amd64 linux)


   Uploaded droplet (388.1M)

REQUEST: [2021-11-09T13:13:54+01:00]
GET /api/v1/read/3477ccad-60fc-4814-b950-59a07ebdc599?envelope_types=LOG&start_time=1636460032013027338 HTTP/1.1
Host: log-cache.appcloudtest.***.com
Authorization: [PRIVATE DATA HIDDEN]
User-Agent: cf/7.3.0+645c3ce6a.2021-08-16 (go1.16.6; amd64 linux)


ERROR: [2021-11-09T13:13:54+01:00]
Get "https://log-cache.appcloudtest.***.com/api/v1/read/3477ccad-60fc-4814-b950-59a07ebdc599?envelope_types=LOG&start_time=1636460032013027338": context c
anceled

krismarc avatar Nov 09 '21 12:11 krismarc

Hey there,

I just simulated reading the log-cache API data myself while doing cf push. I was able to retrieve more lines than the cf cli itself for the same app creation at the same time.

CF CLI: # cf push

Waiting for API to complete processing files...

Staging app and tracing logs...
   Downloading liberty_buildpack_uat...
   Downloaded liberty_buildpack_uat
   Cell 6b12aa8a-0beb-4760-a88d-cdbbcea645bc creating container for instance 56c5cf45-5a53-4dbe-b9c9-5f44a1e21c92
   Cell 6b12aa8a-0beb-4760-a88d-cdbbcea645bc successfully created container for instance 56c5cf45-5a53-4dbe-b9c9-5f44a1e21c92
   Downloading build artifacts cache...
   Downloading app package...
   Downloaded app package (55.1M)
   Downloaded build artifacts cache (335.2M)
   -----> Liberty Buildpack Version: v3.20-sr
   -----> Java version selected: 1.8.0_sr6fp3501
   -----> Downloading IBM 1.8.0_sr6fp3501 JRE from http://wls-java-8-0-6-35-01.appcloudtest-pub.***.com/linux/sr-ibm-java-jre-8.0.6.35_01-linux.tar.gz ... (0.0s)
            Expanding JRE to .java ... (1.9s)
   -----> Liberty version selected: 21.0.0_0901
   -----> Downloading 21.0.0_0901 from http://wls-21-0-0-9_01.appcloudtest-pub.***.com/sr-wlp-base-javaee-all-21.0.0.9_01.zip ... (0.0s)
            Installing archive ... (3.2s)
   -----> Liberty buildpack is done creating the droplet
   Exit status 0
   Uploading droplet, build artifacts cache...
   Uploading droplet...
   Uploading build artifacts cache...
   Uploaded build artifacts cache (335.2M)

Waiting for app main-vNext to start...

Instances starting...
Instances starting...
Instances starting...

my own log-cache reader output - based on it's API:

# ./read_log-cache.sh

Cell 6b12aa8a-0beb-4760-a88d-cdbbcea645bc successfully created container for instance 56c5cf45-5a53-4dbe-b9c9-5f44a1e21c92
Downloading build artifacts cache...
Downloading app package...
Downloaded app package (55.1M)
Downloaded build artifacts cache (335.2M)
-----> Liberty Buildpack Version: v3.20-sr
-----> Java version selected: 1.8.0_sr6fp3501
-----> Downloading IBM 1.8.0_sr6fp3501 JRE from http://wls-java-8-0-6-35-01.appcloudtest-pub.***.com/linux/sr-ibm-java-jre-8.0.6.35_01-linux.tar.gz ... (0.0s)
         Expanding JRE to .java ... (1.9s)
-----> Liberty version selected: 21.0.0_0901
-----> Downloading 21.0.0_0901 from http://wls-21-0-0-9_01.appcloudtest-pub.***.com/sr-wlp-base-javaee-all-21.0.0.9_01.zip ... (0.0s)
         Installing archive ... (3.2s)
-----> Liberty buildpack is done creating the droplet
Exit status 0
Uploading droplet, build artifacts cache...
Uploading droplet...
Uploading build artifacts cache...
Creating droplet for app with guid 3477ccad-60fc-4814-b950-59a07ebdc599
Uploaded build artifacts cache (335.2M)
Uploaded droplet (388.1M)
Uploading complete
Cell 6b12aa8a-0beb-4760-a88d-cdbbcea645bc stopping instance 56c5cf45-5a53-4dbe-b9c9-5f44a1e21c92
Cell 6b12aa8a-0beb-4760-a88d-cdbbcea645bc destroying container for instance 56c5cf45-5a53-4dbe-b9c9-5f44a1e21c92
Cell 6b12aa8a-0beb-4760-a88d-cdbbcea645bc successfully destroyed container for instance 56c5cf45-5a53-4dbe-b9c9-5f44a1e21c92
Updated app with guid 3477ccad-60fc-4814-b950-59a07ebdc599 ({:droplet_guid=>"bd33afa8-6037-4e4c-8f25-9117cb27796c"})
Creating revision for app with guid 3477ccad-60fc-4814-b950-59a07ebdc599
Cell f402e268-7f12-4e73-b70d-ab4e14827e74 stopping instance e21ee056-e8c1-4a40-714f-ec77
Restarted app with guid 3477ccad-60fc-4814-b950-59a07ebdc599
Cell 6b12aa8a-0beb-4760-a88d-cdbbcea645bc creating container for instance 00cad8ba-1fd8-45bd-750a-2a95
Downloading droplet...
[AUDIT   ] CWWKE1100I: Waiting for up to 30 seconds for the server to quiesce.
[INFO    ] CWWKO0220I: TCP Channel defaultHttpEndpoint has stopped listening for requests on host *  (IPv4) port 8080.
[INFO    ] CWWKO0220I: TCP Channel wasJmsEndpoint504 has stopped listening for requests on host *  (IPv4) port 7276.
[INFO    ] CWWKO0220I: TCP Channel wasJmsEndpoint504-ssl has stopped listening for requests on host *  (IPv4) port 7286.

it's simply curl based doing some manipulation with retrieved log payload (not prefect at all, but it's enough to show differences): # cat read_log-cache.sh

#!/bin/bash

export guid=$(cf app main-vNext --guid)
export token=$(cf oauth-token)
log_cache_url="https://log-cache.appcloudtest.***.com"

previous_list=()
while true;
do
  timestamp=$(( $(date -d '-2 seconds' +%s%N) ))
  payload=$(curl -H "Authorization: ${token}" "${log_cache_url}/api/v1/read/${guid}?envelope_types=LOG&start_time=${timestamp}" -s | jq -r .envelopes.batch[].log.payload )

  list=()
  for i in $payload; do
    list+=("$i")
  done
  if [[ ${#list[@]} == 0 ]]; then
    continue
  fi

  for el in ${list[@]}; do
    for ep in ${previous_list[@]}; do
        if  [[ ${ep} == ${el}  ]]; then
            #echo "already printed $ep "
            list=("${list[@]/$ep}")
        fi
    done
  done

  for log in ${list[@]}; do
    echo "$log" | base64 -d
    echo ""
  done

  previous_list=("${list[@]}")
  sleep 1.5
done

Log comparison using vimdiff (my reader on the left, cf cli on the right): compare

Best regards, K.M.

krismarc avatar Nov 09 '21 17:11 krismarc

Hello,

During migration form v6 to v7 CF CLI switched from loggregator to log-cache for log streaming. This new approach is more scalable but it has some known side effects tldr; There is no delivery guarantee for cf logs as a trade off to be able to stream from large number of instances efficiently. You're right in sense that you can adjust walking window and capture more logs from API, you can also look and fine tune delays in our log-cache client configuration If your case requires streaming all logs with delivery guarantee, please consider 3rd party tools to integrate with log-cache or to collect logs directly from instances.

CC @hjcalderon10 , @jdgonzaleza, @pivotalgeorge

a-b avatar Nov 24 '21 22:11 a-b

Hi @a-b

thank you for your response and pointing me out to this documentation!

I went through all comments in your tracker: https://www.pivotaltracker.com/n/projects/892938/stories/171759407/comments/212391238 ..and I did learn that 2s delay is not enough in our environment to 'settle' all logs in the log-cache therefore some lines are missing.

Increasing the WalkDelay to 4s logcache.WithWalkDelay(4*time.Second) is enough to get all the logs with small exception. The stream is being cancelled immediately once the staging phase is completed therefore the last logging iteration is not continued and very last lines produced within the time equal to WalkDelay are not present. Preventing this stream cancellation by adding some sleep solves it!

Hard example - just to proof that. I did set WalkDelay to 60s and I did get no logs from the client at all. The stream got cancelled because staging phase got completed before it came to print out envelopes retrieved from the log-cache. image

..by adding extra sleep upfront this stream cancellation with a value greater than actual WalkDelay I was able to retrieve everything:

        case v7pushaction.StartingStaging:
                cmd.UI.DisplayNewline()
                cmd.UI.DisplayText("Staging app and tracing logs...")
                logStream, errStream, cancelFunc, warnings, err := cmd.VersionActor.GetStreamingLogsForApplicationByNameAndSpace(appName, cmd.Config.TargetedSpace().GUID, cmd.LogCacheClient) 
// /\ the stream starts here /\
                cmd.UI.DisplayWarnings(warnings)
                if err != nil {
                        return err
                }
                if cmd.stopStreamingFunc != nil {
                        cmd.stopStreamingFunc()
                }
                cmd.stopStreamingFunc = cancelFunc
                go cmd.getLogs(logStream, errStream)
        case v7pushaction.StagingComplete:
                time.Sleep(120 * time.Second) // stream cancel prevention to make the last logging iteration possible
                if cmd.stopStreamingFunc != nil {
                        cmd.stopStreamingFunc() // the stream is being cancelled here
                        cmd.stopStreamingFunc = nil
                }

image

So, I would assume we got 2 things here.

  1. Is it anyhow possible to configure WalkDelay without building the client myself? Or are you willing to make it possible? Or even change the default value to higher than 2s?
  2. Shouldn't you consider adding smarter blocker for this stream cancellation to let the client print out all remaining staging lines? Somehow based on the value of WalkDelay

..with WalkDelay set to 4s without the stream cancel prevention. We basically miss just following lines:

   Uploaded droplet (46.4M)
   Uploading complete
   Cell 77d61383-eb71-4183-826c-959141b6add6 stopping instance 60705a22-61ca-47d8-abd4-1ac842211c35
   Cell 77d61383-eb71-4183-826c-959141b6add6 destroying container for instance 60705a22-61ca-47d8-abd4-1ac842211c35
   Cell 77d61383-eb71-4183-826c-959141b6add6 successfully destroyed container for instance 60705a22-61ca-47d8-abd4-1ac842211c35

image

Hope you will find time to respond a bit quicker :)

Best regards, K.M.

krismarc avatar Nov 25 '21 15:11 krismarc

Hi @a-b

Is there any updates on K.M's questions? I wonder if this can be solved within the CLI!

anyandrea avatar Dec 20 '21 12:12 anyandrea