UI crashing with inconsistent dri device mapping on redroid:11.0.0-latest
When running redroid:11.0.0-latest on my new RX 6950 XT, UI seems to be constantly crashing, giving only a flicking wallpaper on scrcpy.
I use podman with root to run redroid:
sudo podman run -itd --name redroid_test --privileged -p 5555:5555 redroid/redroid:11.0.0-latest \
androidboot.redroid_width=1920 androidboot.redroid_height=1080 \
androidboot.use_memfd=1 \
androidboot.redroid_gpu_mode=host \
androidboot.redroid_gpu_node=/dev/dri/renderD128
My /dev/dri/renderD128 is mapped to RX 6950 XT, and /dev/dri/renderD129 is mapped to iGPU (with Intel 9900K). Once setting androidboot.redroid_gpu_node=/dev/dri/renderD129 everything runs fine
EDIT: this part of assumption is wrong, see comment below
Logs collected through curl -fsSL https://raw.githubusercontent.com/remote-android/redroid-doc/master/debug.sh | sed s/docker/podman/g |sudo bash -s -- redroid_test
image-inspect.txt network.txt dumpsys.txt container-inspect.txt getprop.txt vainfo.txt ps.txt podman-info.txt logcat.txt dri.txt dmesg.txt crash.txt uname.txt lscpu.txt getenforce.txt drivers.txt config-6.4.11-zen2-1-zen.txt
This looks like a mesa issue, maybe I can try to re-build redroid image when I have time to see if it works.
BTW, is it possible to just bind mount my system mesa into the container? What paths should I look for?
Something weird:
- from
dumpsys.txt,GLES: AMD, AMD Radeon RX 6950 XT (navi21, LLVM 16.0.0, DRM 3.52, 6.4.11-zen2-1-zen), OpenGL ES 3.2 Mesa 23.0.1 (git-b590fd1951) - from
getprop.txt,[ro.hardware.vulkan]: [intel]Did all these logs collected from the same container? (BTW, please provide the compressed tarball)
You cannot just bind mount mesa libs from host (ABI incompatible); It's possible to build mesa3d / llvm libs and bind mount into redroid container.
The logs are indeed from the same container, and after double-checking I found something weird on my system setup that might be relevant to this inconsistency.
The assumption about DRI device mapping in my initial post is wrong, it seems that on my system /dev/dri/card0 and /dev/dri/renderD129 are mapped to AMD dGPU while /dev/dri/card1 and /dev/dri/renderD128 are mapped to intel iGPU:
~ sudo /bin/ls -l /dev/dri
total 0
drwxr-xr-x 2 root root 120 Aug 22 10:52 by-path
crw-rw----+ 1 root video 226, 0 Aug 22 19:34 card0
crw-rw----+ 1 root video 226, 1 Aug 22 10:52 card1
crw-rw-rw- 1 root render 226, 128 Aug 22 10:32 renderD128
crw-rw-rw- 1 root render 226, 129 Aug 22 10:52 renderD129
~ sudo bash -c 'for i in /sys/kernel/debug/dri/*/name; do echo -n "$i: "; /bin/cat $i; done;'
/sys/kernel/debug/dri/0/name: amdgpu dev=0000:03:00.0 unique=0000:03:00.0
/sys/kernel/debug/dri/128/name: i915 dev=0000:00:02.0 master=pci:0000:00:02.0 unique=0000:00:02.0
/sys/kernel/debug/dri/129/name: amdgpu dev=0000:03:00.0 unique=0000:03:00.0
/sys/kernel/debug/dri/1/name: i915 dev=0000:00:02.0 master=pci:0000:00:02.0 unique=0000:00:02.0
The order of the DRI devices are inverted in card* and renderD* nodes. My theory is that if any of redroid's system assumed it will be in the same order, there might be a inconsistency there. ( maybe [ro.hardware.vulkan] reads from /dev/renderD128 and gets intel but mesa GLES reads from /dev/card0 and gets the AMD GPU)
How redroid still work with setting androidboot.redroid_gpu_node=/dev/dri/renderD129 (which is actually setting AMD GPU upon second check) under this mess is still a mystery to me.
Edited to reflect new findings
It seems that under my setup mesa GLES will always select AMD GPU but vulkan will select according to androidboot.redroid_gpu_node
With androidboot.redroid_gpu_node=/dev/dri/renderD129 (working config):
$ rg vulkan tmp/tmp.En9k2XamRr/getprop.txt
360:[ro.hardware.vulkan]: [radeon]
361:[ro.hwui.use_vulkan]: []
$ rg GLES tmp/tmp.En9k2XamRr/dumpsys.txt
582:GLES: AMD, AMD Radeon RX 6950 XT (navi21, LLVM 16.0.0, DRM 3.52, 6.4.11-zen2-1-zen), OpenGL ES 3.2 Mesa 23.0.1 (git-b590fd1951)
With androidboot.redroid_gpu_node=/dev/dri/renderD128 (UI crashing):
$ rg vulkan tmp/tmp.XFPkQr4em1/getprop.txt
360:[ro.hardware.vulkan]: [intel]
361:[ro.hwui.use_vulkan]: []
$ rg GLES tmp/tmp.XFPkQr4em1/dumpsys.txt
410:GLES: AMD, AMD Radeon RX 6950 XT (navi21, LLVM 16.0.0, DRM 3.52, 6.4.11-zen2-1-zen), OpenGL ES 3.2 Mesa 23.0.1 (git-b590fd1951)
I here attach both log packages for inspection: (github attachments won't accept tarballs so I re-compress them to zip)
Crashing renderD128: renderD128.zip
Working renderD129: renderD129.zip
Just tested with 2 GPUs in QEMU (virtio-gpu + passthrougth-dGPU), and redroid worked as expected (can select different DRI drivers when different render nodes provided).
Can you grab the logcat when renderD128 is provided (should grab immediately after container started, otherwise the logcat might be overwritten)? I want to check the detailed logs from allocator HIDL / surfaceflinger.
Sorry getting occupied by other duties, will get back on that requested testing this weekend.
Something weird happened on my system: after a reboot yesterday, I got the mapping of GPUs reversed again:
$ sudo bash -c 'for i in /sys/kernel/debug/dri/*/name; do echo -n "$i: "; /bin/cat $i; done;'
/sys/kernel/debug/dri/0/name: i915 dev=0000:00:02.0 master=pci:0000:00:02.0 unique=0000:00:02.0
/sys/kernel/debug/dri/128/name: amdgpu dev=0000:03:00.0 unique=0000:03:00.0
/sys/kernel/debug/dri/129/name: i915 dev=0000:00:02.0 master=pci:0000:00:02.0 unique=0000:00:02.0
/sys/kernel/debug/dri/1/name: amdgpu dev=0000:03:00.0 unique=0000:03:00.0
Theoretically, these devices' order should depend on the order that the kernel enumerated them at boot time. The only change I made to my system since last time though, is to connect another monitor. However, I cannot get the order back to the old one even if I remove the new monitor and reboot the system It seems like some random race condition at boot determined the enumeration order.
With the GPU order getting reversed, I can no longer re-produce the error again. Now both /dev/dri/renderD128 and /dev/dri/renderD129 work, and they are giving:
androidboot.redroid_gpu_node=/dev/dri/renderD128:
$ adb shell dumpsys | grep GLES
GLES: Intel, Mesa Intel(R) UHD Graphics 630 (CFL GT2), OpenGL ES 3.2 Mesa 23.0.1 (git-b590fd1951)
$ adb shell getprop | grep vulkan
[ro.hardware.vulkan]: [radeon]
[ro.hwui.use_vulkan]: []
androidboot.redroid_gpu_node=/dev/dri/renderD129:
~ adb shell dumpsys | grep GLES
GLES: Intel, Mesa Intel(R) UHD Graphics 630 (CFL GT2), OpenGL ES 3.2 Mesa 23.0.1 (git-b590fd1951)
~ adb shell getprop | grep vulkan
[ro.hardware.vulkan]: [intel]
[ro.hwui.use_vulkan]: []
It seems that the GL layer selected the intel device no matter what, and the Vulkan layer selected devices according to config. But the outcome is that the UI always works.
It seems the bug is only triggered when dri devices are mapped reversely (iGPU to card1 and dGPU to card0). Let me try to blacklist i915 from my initramfs and see if that can make kernel enumerate amdgpu first and reproduce the bug.
Ok, some updates, it seems that the bug is only triggered if amd GPU is mapped to /dev/dri/card0 and /dev/dri/renderD129, intel GPU mapped to /dev/dri/card1 and /dev/dri/renderD128. Then setting androidboot.redroid_gpu_node=/dev/dri/renderD128
I was not able to make my system map the GPUs in reverse order again, but it seems that such behavior can be simulated using volume bind mounts:
$ sudo podman run -itd --name redroid_test --privileged -p 5555:5555 \
-v /dev/dri/card0:/dev/dri/card1 -v /dev/dri/card1:/dev/dri/card0 \
-v /dev/dri/renderD129:/dev/dri/renderD128 \
-v /dev/dri/renderD128:/dev/dri/renderD129 \
redroid/redroid:11.0.0-latest \
androidboot.redroid_width=1920 androidboot.redroid_height=1080 \
androidboot.use_memfd=1 \
androidboot.redroid_gpu_mode=host \
androidboot.redroid_gpu_node=/dev/dri/renderD128
This will reproduce the UI crashing bug, let me grab the logcat requested on this.
Ok the bug is just getting weirder at this point, I run the configuration mentioned in the post above a few times, the bug does not trigger all the time, it's like 10 out of 7 it triggered but out of 3 it just runs fine.
I had to run this command a few times and look at scrcpy to confirm there is UI crashing:
sudo podman run -itd --name redroid_test --privileged -p 5555:5555 \
-v /dev/dri/card0:/dev/dri/card1 -v /dev/dri/card1:/dev/dri/card0 \
-v /dev/dri/renderD129:/dev/dri/renderD128 \
-v /dev/dri/renderD128:/dev/dri/renderD129 \
redroid/redroid:11.0.0-latest androidboot.redroid_width=1920 androidboot.redroid_height=1080 \
androidboot.use_memfd=1 androidboot.redroid_gpu_mode=host \
androidboot.redroid_gpu_node=/dev/dri/renderD128; \
(until adb devices | grep '^127\.0\.0\.1:5555\s*device$'; \
do adb connect 127.0.0.1:5555; done; scrcpy -s 127.0.0.1:5555 --no-audio) &; \
until sudo podman exec redroid_test logcat >> logcat.txt; do echo retry; done;
@zhouziyang logcat requested:
Only DRM render node (/dev/dri/renderDxxx) is used in redroid. And, for simplicity redroid detect gpu settings via debugfs (/sys/fs/kernel/debug/dri/); So, it won't work if try to bind mount /dev/dri/renderD129:/dev/dri/renderD128.
@chaserhkj 请问有解决吗,我在双amdgpu上遇到了类似问题,redroid GLES永远选renderD129的卡,如果node设置成D128 launcher就会崩溃,设置成129就正常
Similar issue here.
I am using R5 5600G build-in iGPU + Radeon RX 6650 XT dGPU.
In my case, when /dev/dri/card0 and /dev/dri/renderD129 refers to iGPU and /dev/dri/card1 and /dev/dri/renderD128 refers to dGPU, I can only using iGPU to run Redroid without any issue. I tested version 11.0 to 14.0 with the following results:
-
11.0Crashed after boot. -
12.0Crashed after boot. -
13.0Boot ok, but will crashed when Widgets selector comes up (press and hold on default laucher, then tap Widgets), and sometimes it will crashed when switching apps. If the/datadirectory is persisted, the container will never back to normal state after crash. -
14.0Same as13.0.
@chaserhkj Thanks for your information! I am testing if I could using my dGPU to run Redroid and my goal is using hypervisor with GPU passthrough to run Redroid, it seems like my goal is actually a possible solution, interesting. lol
By the way, currently I mapped /dev/dri/card1 to /dev/dri/card0 in container setup and Redroid works fine for now. I can't say this is a stable solution in the long run, but maybe you can try it out @wx5391805 .
I am currently using docker-compose to manager my container configuration and here is my docker-compose.yml for anyone who may wants to refer:
services:
redroid14:
stdin_open: true
tty: true
privileged: true
volumes:
# map `/dev/dri/card1` to `/dev/dri/card0`
- type: bind
source: "/dev/dri/card1"
target: "/dev/dri/card0"
# persisted /data
- ./redroid14_data:/data
ports:
- 5555:5555
image: redroid/redroid:14.0.0_64only-latest
command: >
androidboot.use_memfd=1
androidboot.redroid_width=720
androidboot.redroid_height=1280
androidboot.redroid_fps=60
androidboot.redroid_gpu_mode=host
androidboot.redroid_gpu_node=/dev/dri/renderD128
Having the exact same issue. Does anyone know of a solution?
I tried the volume binding trick but with no luck.