redroid-doc icon indicating copy to clipboard operation
redroid-doc copied to clipboard

UI crashing with inconsistent dri device mapping on redroid:11.0.0-latest

Open chaserhkj opened this issue 2 years ago • 14 comments

When running redroid:11.0.0-latest on my new RX 6950 XT, UI seems to be constantly crashing, giving only a flicking wallpaper on scrcpy.

I use podman with root to run redroid:

sudo podman run -itd --name redroid_test --privileged -p 5555:5555 redroid/redroid:11.0.0-latest \
androidboot.redroid_width=1920 androidboot.redroid_height=1080 \
androidboot.use_memfd=1 \
androidboot.redroid_gpu_mode=host \
androidboot.redroid_gpu_node=/dev/dri/renderD128

My /dev/dri/renderD128 is mapped to RX 6950 XT, and /dev/dri/renderD129 is mapped to iGPU (with Intel 9900K). Once setting androidboot.redroid_gpu_node=/dev/dri/renderD129 everything runs fine

EDIT: this part of assumption is wrong, see comment below

Logs collected through curl -fsSL https://raw.githubusercontent.com/remote-android/redroid-doc/master/debug.sh | sed s/docker/podman/g |sudo bash -s -- redroid_test

image-inspect.txt network.txt dumpsys.txt container-inspect.txt getprop.txt vainfo.txt ps.txt podman-info.txt logcat.txt dri.txt dmesg.txt crash.txt uname.txt lscpu.txt getenforce.txt drivers.txt config-6.4.11-zen2-1-zen.txt

This looks like a mesa issue, maybe I can try to re-build redroid image when I have time to see if it works.

chaserhkj avatar Aug 21 '23 22:08 chaserhkj

BTW, is it possible to just bind mount my system mesa into the container? What paths should I look for?

chaserhkj avatar Aug 21 '23 22:08 chaserhkj

Something weird:

  • from dumpsys.txt, GLES: AMD, AMD Radeon RX 6950 XT (navi21, LLVM 16.0.0, DRM 3.52, 6.4.11-zen2-1-zen), OpenGL ES 3.2 Mesa 23.0.1 (git-b590fd1951)
  • from getprop.txt, [ro.hardware.vulkan]: [intel] Did all these logs collected from the same container? (BTW, please provide the compressed tarball)

You cannot just bind mount mesa libs from host (ABI incompatible); It's possible to build mesa3d / llvm libs and bind mount into redroid container.

zhouziyang avatar Aug 22 '23 09:08 zhouziyang

The logs are indeed from the same container, and after double-checking I found something weird on my system setup that might be relevant to this inconsistency.

The assumption about DRI device mapping in my initial post is wrong, it seems that on my system /dev/dri/card0 and /dev/dri/renderD129 are mapped to AMD dGPU while /dev/dri/card1 and /dev/dri/renderD128 are mapped to intel iGPU:

~ sudo /bin/ls -l /dev/dri
total 0
drwxr-xr-x  2 root root        120 Aug 22 10:52 by-path
crw-rw----+ 1 root video  226,   0 Aug 22 19:34 card0
crw-rw----+ 1 root video  226,   1 Aug 22 10:52 card1
crw-rw-rw-  1 root render 226, 128 Aug 22 10:32 renderD128
crw-rw-rw-  1 root render 226, 129 Aug 22 10:52 renderD129
~ sudo bash -c 'for i in /sys/kernel/debug/dri/*/name; do echo -n "$i: "; /bin/cat $i; done;'
/sys/kernel/debug/dri/0/name: amdgpu dev=0000:03:00.0 unique=0000:03:00.0
/sys/kernel/debug/dri/128/name: i915 dev=0000:00:02.0 master=pci:0000:00:02.0 unique=0000:00:02.0
/sys/kernel/debug/dri/129/name: amdgpu dev=0000:03:00.0 unique=0000:03:00.0
/sys/kernel/debug/dri/1/name: i915 dev=0000:00:02.0 master=pci:0000:00:02.0 unique=0000:00:02.0

The order of the DRI devices are inverted in card* and renderD* nodes. My theory is that if any of redroid's system assumed it will be in the same order, there might be a inconsistency there. ( maybe [ro.hardware.vulkan] reads from /dev/renderD128 and gets intel but mesa GLES reads from /dev/card0 and gets the AMD GPU)

How redroid still work with setting androidboot.redroid_gpu_node=/dev/dri/renderD129 (which is actually setting AMD GPU upon second check) under this mess is still a mystery to me.

chaserhkj avatar Aug 23 '23 00:08 chaserhkj

Edited to reflect new findings

chaserhkj avatar Aug 23 '23 00:08 chaserhkj

It seems that under my setup mesa GLES will always select AMD GPU but vulkan will select according to androidboot.redroid_gpu_node

With androidboot.redroid_gpu_node=/dev/dri/renderD129 (working config):

$ rg vulkan tmp/tmp.En9k2XamRr/getprop.txt
360:[ro.hardware.vulkan]: [radeon]
361:[ro.hwui.use_vulkan]: []
$ rg GLES tmp/tmp.En9k2XamRr/dumpsys.txt
582:GLES: AMD, AMD Radeon RX 6950 XT (navi21, LLVM 16.0.0, DRM 3.52, 6.4.11-zen2-1-zen), OpenGL ES 3.2 Mesa 23.0.1 (git-b590fd1951)

With androidboot.redroid_gpu_node=/dev/dri/renderD128 (UI crashing):

$ rg vulkan tmp/tmp.XFPkQr4em1/getprop.txt
360:[ro.hardware.vulkan]: [intel]
361:[ro.hwui.use_vulkan]: []
$ rg GLES tmp/tmp.XFPkQr4em1/dumpsys.txt
410:GLES: AMD, AMD Radeon RX 6950 XT (navi21, LLVM 16.0.0, DRM 3.52, 6.4.11-zen2-1-zen), OpenGL ES 3.2 Mesa 23.0.1 (git-b590fd1951)

I here attach both log packages for inspection: (github attachments won't accept tarballs so I re-compress them to zip)

Crashing renderD128: renderD128.zip

Working renderD129: renderD129.zip

chaserhkj avatar Aug 23 '23 00:08 chaserhkj

Just tested with 2 GPUs in QEMU (virtio-gpu + passthrougth-dGPU), and redroid worked as expected (can select different DRI drivers when different render nodes provided).

Can you grab the logcat when renderD128 is provided (should grab immediately after container started, otherwise the logcat might be overwritten)? I want to check the detailed logs from allocator HIDL / surfaceflinger.

zhouziyang avatar Aug 24 '23 02:08 zhouziyang

Sorry getting occupied by other duties, will get back on that requested testing this weekend.

chaserhkj avatar Aug 24 '23 23:08 chaserhkj

Something weird happened on my system: after a reboot yesterday, I got the mapping of GPUs reversed again:

$ sudo bash -c 'for i in /sys/kernel/debug/dri/*/name; do echo -n "$i: "; /bin/cat $i; done;'
/sys/kernel/debug/dri/0/name: i915 dev=0000:00:02.0 master=pci:0000:00:02.0 unique=0000:00:02.0
/sys/kernel/debug/dri/128/name: amdgpu dev=0000:03:00.0 unique=0000:03:00.0
/sys/kernel/debug/dri/129/name: i915 dev=0000:00:02.0 master=pci:0000:00:02.0 unique=0000:00:02.0
/sys/kernel/debug/dri/1/name: amdgpu dev=0000:03:00.0 unique=0000:03:00.0

Theoretically, these devices' order should depend on the order that the kernel enumerated them at boot time. The only change I made to my system since last time though, is to connect another monitor. However, I cannot get the order back to the old one even if I remove the new monitor and reboot the system It seems like some random race condition at boot determined the enumeration order.

With the GPU order getting reversed, I can no longer re-produce the error again. Now both /dev/dri/renderD128 and /dev/dri/renderD129 work, and they are giving:

androidboot.redroid_gpu_node=/dev/dri/renderD128:

$ adb shell dumpsys  | grep GLES
GLES: Intel, Mesa Intel(R) UHD Graphics 630 (CFL GT2), OpenGL ES 3.2 Mesa 23.0.1 (git-b590fd1951)
$ adb shell getprop  | grep vulkan
[ro.hardware.vulkan]: [radeon]
[ro.hwui.use_vulkan]: []

androidboot.redroid_gpu_node=/dev/dri/renderD129:

~ adb shell dumpsys  | grep GLES
GLES: Intel, Mesa Intel(R) UHD Graphics 630 (CFL GT2), OpenGL ES 3.2 Mesa 23.0.1 (git-b590fd1951)
~ adb shell getprop  | grep vulkan
[ro.hardware.vulkan]: [intel]
[ro.hwui.use_vulkan]: []

It seems that the GL layer selected the intel device no matter what, and the Vulkan layer selected devices according to config. But the outcome is that the UI always works.

It seems the bug is only triggered when dri devices are mapped reversely (iGPU to card1 and dGPU to card0). Let me try to blacklist i915 from my initramfs and see if that can make kernel enumerate amdgpu first and reproduce the bug.

chaserhkj avatar Aug 27 '23 09:08 chaserhkj

Ok, some updates, it seems that the bug is only triggered if amd GPU is mapped to /dev/dri/card0 and /dev/dri/renderD129, intel GPU mapped to /dev/dri/card1 and /dev/dri/renderD128. Then setting androidboot.redroid_gpu_node=/dev/dri/renderD128

I was not able to make my system map the GPUs in reverse order again, but it seems that such behavior can be simulated using volume bind mounts:

$ sudo podman run -itd --name redroid_test --privileged -p 5555:5555  \
-v /dev/dri/card0:/dev/dri/card1 -v /dev/dri/card1:/dev/dri/card0 \
-v /dev/dri/renderD129:/dev/dri/renderD128 \
-v /dev/dri/renderD128:/dev/dri/renderD129 \
redroid/redroid:11.0.0-latest \
androidboot.redroid_width=1920 androidboot.redroid_height=1080 \
androidboot.use_memfd=1 \
androidboot.redroid_gpu_mode=host \
androidboot.redroid_gpu_node=/dev/dri/renderD128

This will reproduce the UI crashing bug, let me grab the logcat requested on this.

chaserhkj avatar Aug 28 '23 07:08 chaserhkj

Ok the bug is just getting weirder at this point, I run the configuration mentioned in the post above a few times, the bug does not trigger all the time, it's like 10 out of 7 it triggered but out of 3 it just runs fine.

I had to run this command a few times and look at scrcpy to confirm there is UI crashing:

sudo podman run -itd --name redroid_test --privileged -p 5555:5555 \
-v /dev/dri/card0:/dev/dri/card1 -v /dev/dri/card1:/dev/dri/card0 \
-v /dev/dri/renderD129:/dev/dri/renderD128 \
-v /dev/dri/renderD128:/dev/dri/renderD129 \
redroid/redroid:11.0.0-latest androidboot.redroid_width=1920 androidboot.redroid_height=1080 \
androidboot.use_memfd=1 androidboot.redroid_gpu_mode=host \
androidboot.redroid_gpu_node=/dev/dri/renderD128; \
(until adb devices | grep '^127\.0\.0\.1:5555\s*device$'; \
do adb connect 127.0.0.1:5555; done; scrcpy -s 127.0.0.1:5555 --no-audio) &; \
until sudo podman exec redroid_test logcat >> logcat.txt; do echo retry; done;

@zhouziyang logcat requested:

logcat.txt

chaserhkj avatar Aug 28 '23 08:08 chaserhkj

Only DRM render node (/dev/dri/renderDxxx) is used in redroid. And, for simplicity redroid detect gpu settings via debugfs (/sys/fs/kernel/debug/dri/); So, it won't work if try to bind mount /dev/dri/renderD129:/dev/dri/renderD128.

zhouziyang avatar Aug 30 '23 11:08 zhouziyang

@chaserhkj 请问有解决吗,我在双amdgpu上遇到了类似问题,redroid GLES永远选renderD129的卡,如果node设置成D128 launcher就会崩溃,设置成129就正常

wx5391805 avatar Apr 12 '24 06:04 wx5391805

Similar issue here.

I am using R5 5600G build-in iGPU + Radeon RX 6650 XT dGPU.

In my case, when /dev/dri/card0 and /dev/dri/renderD129 refers to iGPU and /dev/dri/card1 and /dev/dri/renderD128 refers to dGPU, I can only using iGPU to run Redroid without any issue. I tested version 11.0 to 14.0 with the following results:

  • 11.0 Crashed after boot.
  • 12.0 Crashed after boot.
  • 13.0 Boot ok, but will crashed when Widgets selector comes up (press and hold on default laucher, then tap Widgets), and sometimes it will crashed when switching apps. If the /data directory is persisted, the container will never back to normal state after crash.
  • 14.0 Same as 13.0.

@chaserhkj Thanks for your information! I am testing if I could using my dGPU to run Redroid and my goal is using hypervisor with GPU passthrough to run Redroid, it seems like my goal is actually a possible solution, interesting. lol

By the way, currently I mapped /dev/dri/card1 to /dev/dri/card0 in container setup and Redroid works fine for now. I can't say this is a stable solution in the long run, but maybe you can try it out @wx5391805 .

I am currently using docker-compose to manager my container configuration and here is my docker-compose.yml for anyone who may wants to refer:

services:
  redroid14:
    stdin_open: true
    tty: true
    privileged: true
    volumes:
      # map `/dev/dri/card1` to `/dev/dri/card0`
      - type: bind
        source: "/dev/dri/card1"
        target: "/dev/dri/card0"
      # persisted /data
      - ./redroid14_data:/data
    ports:
      - 5555:5555
    image: redroid/redroid:14.0.0_64only-latest
    command: >
        androidboot.use_memfd=1
        androidboot.redroid_width=720
        androidboot.redroid_height=1280
        androidboot.redroid_fps=60
        androidboot.redroid_gpu_mode=host
        androidboot.redroid_gpu_node=/dev/dri/renderD128

stormyyd avatar Jun 29 '24 14:06 stormyyd

Having the exact same issue. Does anyone know of a solution?

I tried the volume binding trick but with no luck.

redroid-debug.2r8PHs0c (no crash, renderD129).zip

redroid-debug.tsB3FwPE (crash, renderD128).zip

2meito avatar Aug 16 '25 01:08 2meito