pixie java flamegraph not complete

Describe the bug I deployed the pixie (self host), find the flame-graph is not complete

To Reproduce Steps to reproduce the behavior:

Deploy pixie for k8s os: Ubuntu 22.04.3 LTS
kernel: 6.2.0-1015-aws vizier-pem: gcr.io/pixie-oss/pixie-prod/vizier-pem_image:0.14.9 java version

openjdk version "1.8.0-builds.shipilev.net-openjdk-shenandoah-jdk8-b798-20210222"
OpenJDK Runtime Environment (build 1.8.0-builds.shipilev.net-openjdk-shenandoah-jdk8-b798-20210222-testing-b00)
OpenJDK 64-Bit Server VM (build 25.71-b00, mixed mode)

Write java code

public class HelloWorld {
    public static void main(String[] args) {
        while (true) { 
                int i = 0;
                while (i < 1000000) {
                        f1();
                        i++;
                }
                try {
                // 暂停 5 毫秒
                        Thread.sleep(5);
                    } catch (InterruptedException e) {
                        System.out.println("Sleep was interrupted.");
                    }
        }
    }
    public static void f1() {
                try {
                // 暂停 5 毫秒
                        Thread.sleep(1);
                    } catch (InterruptedException e) {
                        System.out.println("Sleep was interrupted.");
                    }
        f2();
    }
    public static void f2() {
                try {
                // 暂停 5 毫秒
                        Thread.sleep(1);
                    } catch (InterruptedException e) {
                        System.out.println("Sleep was interrupted.");
                    }
        f3();
    }
    public static void f3() {
                try {
                // 暂停 5 毫秒
                        Thread.sleep(1);
                    } catch (InterruptedException e) {
                        System.out.println("Sleep was interrupted.");
                    }

    }

}

set entry point of the container image; and run in pod java -XX:+PreserveFramePointer HelloWorld
select the pod and lookup Flamegraph

but I can not see f1/f2/f3 symbol

App information (please complete the following information):

Pixie version: gcr.io/pixie-oss/pixie-prod/vizier-pem_image:0.14.9
K8s : 1.24
Node Kernel version: 6.2.0-1015-aws
Browser version: Chrome 131.0.6778.266

Jan 28 '25 12:01 datavisorhenryzhao

I find only when i execute javac -g HelloWorld1.java, i can see the function f1(), by both the following cmd. So I think PreserveFramePointer have no effect

java -XX:+PreserveFramePointer HelloWorld1

java HelloWorld1

Jan 28 '25 16:01 datavisorhenryzhao

Hi @datavisorhenryzhao, thanks for the bug report.

Can you collect the PEM logs via px collect-logs while that workload is running? It would also be helpful (but not required) if you could add the following cli flag to the PEM DaemonSet before collecting the logs: --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1

I will try to reproduce the issue on my own sometime this week as well, but having the logs would be very helpful.

Jan 30 '25 00:01 ddelnano

when compiled without '-g'

2.when compiled with '-g', can find f1 symbol, but f2/f3 not found

info of pem-pod(where pod henry runs with same node)

By the way how to add the flag for pem ds? Do you know the full args? --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1

pixie_logs_20250130110549.zip

I also tried parca project (based on ebpf), it shows the full java stack (f1/f2/f3) only when compiled with '-g'

Jan 30 '25 11:01 datavisorhenryzhao

@ddelnano hi, can you help me lookup ? It is weird, because parca works well compared with pixie

Feb 04 '25 17:02 datavisorhenryzhao

@datavisorhenryzhao sorry for the slow response. I have time set aside to investigate this tomorrow and reproduce it myself.

In order to enable that verbose logging, you need to add the following to the DaemonSet pem's container block (example location). Note: there doesn't need to be any additional args provided.

args:
- --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1`

I recommend using kubectl -n pl edit daemonset/vizier-pem to edit it interactively. If you have the operator running, you will need to remove its namespace prior or it might redeploy the DaemonSet and undo the manifest change.

As for the logs you've collected already, I noticed that there are lots of lost stack frames. See this log from your zip file:

I20250130 10:25:54.157191 2541577 perf_profile_connector.cc:433] PerfProfileConnector statistics: kBPFMapSwitchoverEvent=21 kCumulativeSumOfAllStackTraces=464498 **kLossHistoEvent=9554**

For every kLossHistoEvent recorded, that means stack trace data was lost. This happens when Pixie isn't able to process the perf buffer used by our BPF program quickly enough. This likely means that Pixie is under provisioned for your given instance type size and the features you have enabled.

$ grep -r kLossHistoEvent ~/Downloads/pixie_logs_20250130110549 | grep -v 'kLossHistoEvent=0' | cut -d: -f1 | uniq -c
   5 /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-v24kb__pem.log
   4 /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-6dfzq__pem.log
   5 /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-gw4zk__pem.log

$ grep CPUs /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-v24kb__pem.log
I20250130 10:14:07.961859 2771161 system_info.cc:44] Number of CPUs: 8
$ grep CPUs /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-6dfzq__pem.log
I20250130 10:14:31.381803 2538836 system_info.cc:44] Number of CPUs: 8
$ grep CPUs /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-gw4zk__pem.log
I20250130 10:14:08.298007 1112290 system_info.cc:44] Number of CPUs: 8

Since your cluster has a variety of instance types and the kLossHistoEvents are happening on the instances with 8 CPUs, it's possible these instances are under provisioned for Pixie's socket tracer and perf profiler.

In summary, I would try both of these things below:

Pin the Java workload to one of the nodes that isn't experiencing the kLossHistoEvents. This will allow us to rule out the lost data as the source of the invalid flame graph
Your vizier has PL_TABLE_STORE_DATA_LIMIT_MB: 2000 specified. I would remove this and/or increase the PEM's memory limit until the kLossHistoEvents are not present

If we determine that the kLossHistoEvents are the cause, we can discuss how to tune the PEMs memory usage in more detail.

Feb 05 '25 02:02 ddelnano

hi, @ddelnano

1.the pem-pod restarted with args

args:
- --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1

2.the pod henry and pem-pod (vizier-pem-7gzw5) run on the same node.

check the pod(vizier-pem-7gzw5) logs, within 6mins, I did not find the kLossHistoEvent=0

still missing f1/f2/f3 stacks, no matter if using javac -g

Feb 05 '25 09:02 datavisorhenryzhao

@datavisorhenryzhao did you capture the logs with px collect-logs again with that verbose logging enabled?

I did not find the kLossHistoEvent=0

Did you mean kLossHistoEvent=0? That would actually be a good sign as it would indicate no data loss. The case I pointed out showed significant kLossHistoEvents.

Feb 05 '25 14:02 ddelnano

I'm able to reproduce the issue with the following Dockerfile and pod manifest:

Dockerfile and manifest

# Dockerfile
FROM debian:bookworm-slim AS build

# Install dependencies
RUN apt-get update && apt-get install -y \
    curl \
    xz-utils \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Download and extract the specified OpenJDK version
RUN curl -L -o openjdk.tar.xz https://builds.shipilev.net/openjdk-jdk8/openjdk-jdk8-linux-x86_64-server.tar.xz \
    && mkdir -p /opt/openjdk \
    && tar -xJf openjdk.tar.xz -C /opt/openjdk --strip-components=1 \
    && rm openjdk.tar.xz

# Set JAVA_HOME and update PATH
ENV JAVA_HOME=/opt/openjdk
ENV PATH="$JAVA_HOME/bin:$PATH"

# Copy application source code
COPY HelloWorld.java .

# Compile Java application
RUN javac HelloWorld.java

# Runtime image
FROM debian:bookworm-slim

# Install dependencies
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy the compiled Java application and JDK from build stage
COPY --from=build /opt/openjdk /opt/openjdk
COPY --from=build /app/HelloWorld.class .

# Set JAVA_HOME and update PATH
ENV JAVA_HOME=/opt/openjdk
ENV PATH="$JAVA_HOME/bin:$PATH"

# Command to run the application
CMD ["sh", "-c", "java -version && java -XX:+PreserveFramePointer HelloWorld"]

# k8s manifest
apiVersion: v1
kind: Pod
metadata:
  name: java-helloworld
  labels:
    app: java-helloworld
spec:
  containers:
    - args: ["sh", "-c", "java -version && java -XX:+PreserveFramePointer HelloWorld"]
      name: java-helloworld
      image: ddelnano/java-helloworld:latest
      imagePullPolicy: Always
      resources:
        limits:
          cpu: "500m"
          memory: "512Mi"
        requests:
          cpu: "250m"
          memory: "256Mi"
      ports:
        - containerPort: 8080

I've collected logs and don't see anything immediately obvious, but I'm continuing to investigate.

Feb 05 '25 16:02 ddelnano

Thanks for your effort

Feb 05 '25 18:02 datavisorhenryzhao