java flamegraph not complete
Describe the bug I deployed the pixie (self host), find the flame-graph is not complete
To Reproduce Steps to reproduce the behavior:
- Deploy pixie for k8s
os: Ubuntu 22.04.3 LTS
kernel: 6.2.0-1015-aws vizier-pem: gcr.io/pixie-oss/pixie-prod/vizier-pem_image:0.14.9 java version
openjdk version "1.8.0-builds.shipilev.net-openjdk-shenandoah-jdk8-b798-20210222"
OpenJDK Runtime Environment (build 1.8.0-builds.shipilev.net-openjdk-shenandoah-jdk8-b798-20210222-testing-b00)
OpenJDK 64-Bit Server VM (build 25.71-b00, mixed mode)
- Write java code
public class HelloWorld {
public static void main(String[] args) {
while (true) {
int i = 0;
while (i < 1000000) {
f1();
i++;
}
try {
// 暂停 5 毫秒
Thread.sleep(5);
} catch (InterruptedException e) {
System.out.println("Sleep was interrupted.");
}
}
}
public static void f1() {
try {
// 暂停 5 毫秒
Thread.sleep(1);
} catch (InterruptedException e) {
System.out.println("Sleep was interrupted.");
}
f2();
}
public static void f2() {
try {
// 暂停 5 毫秒
Thread.sleep(1);
} catch (InterruptedException e) {
System.out.println("Sleep was interrupted.");
}
f3();
}
public static void f3() {
try {
// 暂停 5 毫秒
Thread.sleep(1);
} catch (InterruptedException e) {
System.out.println("Sleep was interrupted.");
}
}
}
-
set entry point of the container image; and run in pod java -XX:+PreserveFramePointer HelloWorld
-
select the pod and lookup Flamegraph
but I can not see f1/f2/f3 symbol
App information (please complete the following information):
- Pixie version: gcr.io/pixie-oss/pixie-prod/vizier-pem_image:0.14.9
- K8s : 1.24
- Node Kernel version: 6.2.0-1015-aws
- Browser version: Chrome 131.0.6778.266
I find only when i execute javac -g HelloWorld1.java, i can see the function f1(), by both the following cmd. So I think PreserveFramePointer have no effect
java -XX:+PreserveFramePointer HelloWorld1
java HelloWorld1
Hi @datavisorhenryzhao, thanks for the bug report.
Can you collect the PEM logs via px collect-logs while that workload is running? It would also be helpful (but not required) if you could add the following cli flag to the PEM DaemonSet before collecting the logs: --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1
I will try to reproduce the issue on my own sometime this week as well, but having the logs would be very helpful.
- when compiled without '-g'
2.when compiled with '-g', can find f1 symbol, but f2/f3 not found
- info of pem-pod(where pod henry runs with same node)
By the way how to add the flag for pem ds? Do you know the full args? --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1
I also tried parca project (based on ebpf), it shows the full java stack (f1/f2/f3) only when compiled with '-g'
@ddelnano hi, can you help me lookup ? It is weird, because parca works well compared with pixie
@datavisorhenryzhao sorry for the slow response. I have time set aside to investigate this tomorrow and reproduce it myself.
In order to enable that verbose logging, you need to add the following to the DaemonSet pem's container block (example location). Note: there doesn't need to be any additional args provided.
args:
- --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1`
I recommend using kubectl -n pl edit daemonset/vizier-pem to edit it interactively. If you have the operator running, you will need to remove its namespace prior or it might redeploy the DaemonSet and undo the manifest change.
As for the logs you've collected already, I noticed that there are lots of lost stack frames. See this log from your zip file:
I20250130 10:25:54.157191 2541577 perf_profile_connector.cc:433] PerfProfileConnector statistics: kBPFMapSwitchoverEvent=21 kCumulativeSumOfAllStackTraces=464498 **kLossHistoEvent=9554**
For every kLossHistoEvent recorded, that means stack trace data was lost. This happens when Pixie isn't able to process the perf buffer used by our BPF program quickly enough. This likely means that Pixie is under provisioned for your given instance type size and the features you have enabled.
$ grep -r kLossHistoEvent ~/Downloads/pixie_logs_20250130110549 | grep -v 'kLossHistoEvent=0' | cut -d: -f1 | uniq -c
5 /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-v24kb__pem.log
4 /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-6dfzq__pem.log
5 /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-gw4zk__pem.log
$ grep CPUs /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-v24kb__pem.log
I20250130 10:14:07.961859 2771161 system_info.cc:44] Number of CPUs: 8
$ grep CPUs /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-6dfzq__pem.log
I20250130 10:14:31.381803 2538836 system_info.cc:44] Number of CPUs: 8
$ grep CPUs /Users/ciserver/Downloads/pixie_logs_20250130110549/pl__vizier-pem-gw4zk__pem.log
I20250130 10:14:08.298007 1112290 system_info.cc:44] Number of CPUs: 8
Since your cluster has a variety of instance types and the kLossHistoEvents are happening on the instances with 8 CPUs, it's possible these instances are under provisioned for Pixie's socket tracer and perf profiler.
In summary, I would try both of these things below:
- Pin the Java workload to one of the nodes that isn't experiencing the
kLossHistoEvents. This will allow us to rule out the lost data as the source of the invalid flame graph - Your vizier has
PL_TABLE_STORE_DATA_LIMIT_MB: 2000specified. I would remove this and/or increase the PEM's memory limit until thekLossHistoEvents are not present
If we determine that the kLossHistoEvents are the cause, we can discuss how to tune the PEMs memory usage in more detail.
hi, @ddelnano
1.the pem-pod restarted with args
args:
- --vmodule=px_jattach=1,perf_profile_connector=1,stringifier=1,elf_symbolizer=1
2.the pod henry and pem-pod (vizier-pem-7gzw5) run on the same node.
- check the pod(vizier-pem-7gzw5) logs, within 6mins, I did not find the kLossHistoEvent=0
- still missing f1/f2/f3 stacks, no matter if using javac -g
@datavisorhenryzhao did you capture the logs with px collect-logs again with that verbose logging enabled?
I did not find the kLossHistoEvent=0
Did you mean kLossHistoEvent=0? That would actually be a good sign as it would indicate no data loss. The case I pointed out showed significant kLossHistoEvents.
I'm able to reproduce the issue with the following Dockerfile and pod manifest:
Dockerfile and manifest
# Dockerfile
FROM debian:bookworm-slim AS build
# Install dependencies
RUN apt-get update && apt-get install -y \
curl \
xz-utils \
&& rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /app
# Download and extract the specified OpenJDK version
RUN curl -L -o openjdk.tar.xz https://builds.shipilev.net/openjdk-jdk8/openjdk-jdk8-linux-x86_64-server.tar.xz \
&& mkdir -p /opt/openjdk \
&& tar -xJf openjdk.tar.xz -C /opt/openjdk --strip-components=1 \
&& rm openjdk.tar.xz
# Set JAVA_HOME and update PATH
ENV JAVA_HOME=/opt/openjdk
ENV PATH="$JAVA_HOME/bin:$PATH"
# Copy application source code
COPY HelloWorld.java .
# Compile Java application
RUN javac HelloWorld.java
# Runtime image
FROM debian:bookworm-slim
# Install dependencies
RUN apt-get update && apt-get install -y \
curl \
&& rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /app
# Copy the compiled Java application and JDK from build stage
COPY --from=build /opt/openjdk /opt/openjdk
COPY --from=build /app/HelloWorld.class .
# Set JAVA_HOME and update PATH
ENV JAVA_HOME=/opt/openjdk
ENV PATH="$JAVA_HOME/bin:$PATH"
# Command to run the application
CMD ["sh", "-c", "java -version && java -XX:+PreserveFramePointer HelloWorld"]
# k8s manifest
apiVersion: v1
kind: Pod
metadata:
name: java-helloworld
labels:
app: java-helloworld
spec:
containers:
- args: ["sh", "-c", "java -version && java -XX:+PreserveFramePointer HelloWorld"]
name: java-helloworld
image: ddelnano/java-helloworld:latest
imagePullPolicy: Always
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "250m"
memory: "256Mi"
ports:
- containerPort: 8080
I've collected logs and don't see anything immediately obvious, but I'm continuing to investigate.
Thanks for your effort