ROCK-Kernel-Driver icon indicating copy to clipboard operation
ROCK-Kernel-Driver copied to clipboard

Compile errors on CentOS aarch64

Open FinnStokes opened this issue 6 years ago • 9 comments

When installing the rock-dkms package on CentOS 7 aarch64 with kernel 4.14.0-49.el7a.aarch64, the compilation ran into the following problems:

In file included from <command-line>:0:0:
././include/linux/kconfig.h:67:1: fatal error: /usr/src/kernels/4.14.0-49.el7a.aarch64/include/drm/drm_backport.h: No such file or directory
 #endif /* __LINUX_KCONFIG_H */
 ^
In file included from <command-line>:0:0:
././include/linux/kconfig.h:67:1: fatal error: /usr/src/kernels/4.14.0-49.el7a.aarch64/include/drm/drm_backport.h: No such file or directory
 #endif /* __LINUX_KCONFIG_H */
 ^
...

I fixed this by manually overriding the OS_NAME variable to OS_NAME="custom-rhel" in the makefile. I'm pretty sure this is the same as issue #56, so I'm only including it here for completeness, in case if affects one of the other errors.

/var/lib/dkms/amdgpu/2.1-96.el7/build/amd/amdgpu/../display/dc/gpio/hw_factory.c: In function ‘dal_hw_factory_init’:
/var/lib/dkms/amdgpu/2.1-96.el7/build/amd/amdgpu/../display/dc/gpio/hw_factory.c:92:3: error: implicit declaration of function ‘dal_hw_factory_dcn10_init’ [-Werror=implicit-function-declaration]
   dal_hw_factory_dcn10_init(factory);
   ^
cc1: some warnings being treated as errors
make[2]: *** [/var/lib/dkms/amdgpu/2.1-96.el7/build/amd/amdgpu/../display/dc/gpio/hw_factory.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/var/lib/dkms/amdgpu/2.1-96.el7/build/amd/amdgpu/../display/dc/gpio/hw_translate.c: In function ‘dal_hw_translate_init’:
/var/lib/dkms/amdgpu/2.1-96.el7/build/amd/amdgpu/../display/dc/gpio/hw_translate.c:89:3: error: implicit declaration of function ‘dal_hw_translate_dcn10_init’ [-Werror=implicit-function-declaration]
   dal_hw_translate_dcn10_init(translate);
   ^
cc1: some warnings being treated as errors

I fixed this by disabling CONFIG_DRM_AMD_DC_DCN1_01 in the makefile. I think the actual problem is simply a missing include statement in display/dc/gpio/hw_factory.c due to an #ifdef CONFIG_X86, but since I'm not clear why DCN_VERSION_1_0 is disabled in aarch64, and I don't need DCN_VERSION_1_01, I simply disabled it.

/var/lib/dkms/amdgpu/2.1-96.el7/build/ttm/ttm_bo_util.c: In function ‘amdttm_kmap_atomic_prot’:
/var/lib/dkms/amdgpu/2.1-96.el7/build/ttm/ttm_bo_util.c:299:3: error: implicit declaration of function ‘__kcl__kmap_atomic’ [-Werror=implicit-function-declaration]
   return __kcl__kmap_atomic(page);
   ^
/var/lib/dkms/amdgpu/2.1-96.el7/build/ttm/ttm_bo_util.c:299:3: warning: return makes pointer from integer without a cast [enabled by default]
/var/lib/dkms/amdgpu/2.1-96.el7/build/ttm/ttm_bo_util.c: In function ‘amdttm_kunmap_atomic_prot’:
/var/lib/dkms/amdgpu/2.1-96.el7/build/ttm/ttm_bo_util.c:315:3: error: implicit declaration of function ‘__kcl__kunmap_atomic’ [-Werror=implicit-function-declaration]
   __kcl__kunmap_atomic(addr);
   ^

I fixed this by adding the definitions

 #ifdef CONFIG_X86
 #ifdef OS_NAME_RHEL_6
 #define __kcl__kmap_atomic(__page)      kmap_atomic(__page, KM_USER0)
 #define __kcl__kunmap_atomic(__addr)    kunmap_atomic(__addr, KM_USER0)
 #define __ttm_kmap_atomic_prot(__page, __prot)  kmap_atomic_prot(__page, KM_USER0, __prot)
 #define __ttm_kunmap_atomic(__addr)             kunmap_atomic(__addr, KM_USER0)
 #else
 #define __kcl__kmap_atomic(__page)      kmap_atomic(__page)
 #define __kcl__kunmap_atomic(__addr)    kunmap_atomic(__addr)
 #define __ttm_kmap_atomic_prot(__page, __prot) kmap_atomic_prot(__page, __prot)
 #define __ttm_kunmap_atomic(__addr) kunmap_atomic(__addr)
 #endif
 #else
+#define __kcl__kmap_atomic(__page)      kmap_atomic(__page)
+#define __kcl__kunmap_atomic(__addr)    kunmap_atomic(__addr)
 #define __ttm_kmap_atomic_prot(__page, __prot) vmap(&__page, 1, 0,  __prot)
 #define __ttm_kunmap_atomic(__addr) vunmap(__addr)
 #endif

to ttm/ttm_bo_util.c. I'm not sure this is the right approach, but it at least keeps the behaviour on aarch64 the same as before 7d0741bab20cb328c2c778764efc340896242ccb.

It now builds, but the machine won't boot from a kernel image including the compiled modules. I'm now trying to see if I can figure out why.

FinnStokes avatar Feb 11 '19 16:02 FinnStokes

I rolled back to 2.0-89, employing similar fixes to get it to compile. Despite the fact that dkms install amdgpu/2.0-89.el7 gives the error dracut:Failed to install module amdkfd, it seems to be booting fine and /dev/kfd is present so the driver is at least partially working. I will install the rest of ROCm and do some further testing to check that it is actually working.

FinnStokes avatar Feb 11 '19 18:02 FinnStokes

FinnStokes, can you try the 2.2 release and see how it goes? We had a patch internally to address this, but it may not have made it to the 2.2 release. We also merged amdgpu and amdkfd, so that dracut error won't affect things, thankfully. If the error persists (without your fix), then send a Pull Request that we can integrate into our 2.3 build and our base DKMS branch, so that it can be included in all future releases as well. Thanks!

kentrussell avatar Mar 12 '19 12:03 kentrussell

The DKMS branch doesn't support AArch64. We and ARM are working on enabling KFD on AArch64 on the upstream kernel. Your best bet is trying an upstream kernel with the user mode from the ROCm release. Last I checked, amdgpu.sched_policy=2 was needed to get the driver to boot successfully on Aarch64.

fxkamd avatar Mar 12 '19 19:03 fxkamd

@kentrussell, I have updated to 2.2, and the same issues persist. I will create a pull request that fixes all my compilation problems aside from the drm_backport errors that I think are covered by #56. However, I still get a kernel panic on boot. I tried the suggestion from @fxkamd of adding amdgpu.sched_policy=2 to the boot options, but it didn't seem to change anything:

[    6.073918] pcieport 0000:00:00.0: AER: Corrected error received: id=0700
[    6.076534] [drm] Display Core initialized with v3.2.14!
[    6.077606] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    6.077608] [drm] Driver supports precise vblank timestamp query.
[    6.098960] [drm] UVD and UVD ENC initialized successfully.
[    6.106963] pcieport 0000:07:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0700(Transmitter ID)
[    6.117791] pcieport 0000:07:00.0:   device [15b3:1975] error status/mask=00001000/00002000
[    6.126788] pcieport 0000:07:00.0:    [12] Replay Timer Timeout  
[    6.134181] pcieport 0000:00:00.0: AER: Corrected error received: id=0700
[    6.199390] [drm] VCE initialized successfully.
[    6.204476] kfd kfd: Allocated 3969056 bytes on gart
[    6.209492] Virtual CRAT table created for GPU
[    6.213926] Parsing CRAT table with 1 nodes
[    6.218133] Creating topology SYSFS entries
[    6.222680] Topology: Add dGPU node [0x687f:0x1002]
[    6.233357] kfd kfd: added device 1002:687f
[    6.237945] [drm] Cannot find any crtc or sizes
[    6.242723] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[    6.249088] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    6.256040] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    6.263004] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    6.269966] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    6.270463] irq 22: nobody cared (try booting with the "irqpoll" option)
[    6.270472] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OE  ------------   4.14.0-49.el7a.aarch64 #1
[    6.270476] Call trace:
[    6.270498] [<ffff000008088e6c>] dump_backtrace+0x0/0x23c
[    6.270504] [<ffff0000080890cc>] show_stack+0x24/0x2c
[    6.270512] [<ffff0000087f607c>] dump_stack+0x84/0xa8
[    6.270519] [<ffff00000813cb00>] __report_bad_irq+0x40/0xec
[    6.270522] [<ffff00000813ce6c>] note_interrupt+0x1f4/0x2b8
[    6.270529] [<ffff000008139b9c>] handle_irq_event_percpu+0x60/0x88
[    6.270533] [<ffff000008139c14>] handle_irq_event+0x50/0x80
[    6.270536] [<ffff00000813db80>] handle_fasteoi_irq+0x9c/0x144
[    6.270540] [<ffff0000081389cc>] generic_handle_irq+0x34/0x4c
[    6.270544] [<ffff0000081390b8>] __handle_domain_irq+0x6c/0xc4
[    6.270547] [<ffff0000080816c8>] gic_handle_irq+0xa0/0x1b8
[    6.270551] Exception stack(0xffff000008cefd80 to 0xffff000008cefec0)
[    6.270554] fd80: 0000800ff2b50000 ffff000008cb0010 ffff000008d20c34 0000000000000001
[    6.270558] fda0: 0000000000000000 ffff000009351d00 00000000000004de 00000000ffff8d43
[    6.270561] fdc0: ffff000008d262e0 ffff000008cefe30 0000000000000d00 0000000014cf10db
[    6.270564] fde0: 0000000000000018 0000000000010000 001443fcf56b90a0 00008b13e65db528
[    6.270568] fe00: 0000000000000000 0000000000000000 0000000000000012 ffff000008e769f8
[    6.270571] fe20: ffff000008cb0000 ffff000008e76000 0000000000000000 ffff000008d1c504
[    6.270574] fe40: 0000000000000000 0000000000000000 000000107a94dae0 000000107a89b280
[    6.270577] fe60: 0000000080b90018 ffff000008cefec0 ffff000008085480 ffff000008cefec0
[    6.270580] fe80: ffff000008085484 0000000060000005 ffff000008cefea0 ffff000008149210
[    6.270583] fea0: ffffffffffffffff ffff000008149278 ffff000008cefec0 ffff000008085484
[    6.270586] [<ffff000008082fb0>] el1_irq+0xb0/0x140
[    6.270590] [<ffff000008085484>] arch_cpu_idle+0x44/0x144
[    6.270596] [<ffff000008810818>] default_idle_call+0x20/0x30
[    6.270601] [<ffff00000811decc>] do_idle+0x158/0x1cc
[    6.270604] [<ffff00000811e0dc>] cpu_startup_entry+0x28/0x30
[    6.270608] [<ffff00000880a2ac>] rest_init+0xbc/0xc8
[    6.270615] [<ffff000008b90dac>] start_kernel+0x410/0x43c
[    6.270618] handlers:
[    6.270624] [<ffff00000845aa74>] aer_irq
[    6.270628] Disabling IRQ #22
[    6.487676] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    6.487679] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    6.487689] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    6.487693] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    6.487696] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    6.487707] Unable to handle kernel paging request at virtual address ffff000013e8f774
[    6.487709] Mem abort info:
[    6.487711]   Exception class = DABT (current EL), IL = 32 bits
[    6.487713]   SET = 0, FnV = 0
[    6.487715]   EA = 0, S1PTW = 0
[    6.487716] Data abort info:
[    6.487718]   ISV = 0, ISS = 0x00000021
[    6.487720]   CM = 0, WnR = 0
[    6.487723] swapper pgtable: 64k pages, 48-bit VAs, pgd = ffff000009660000
[    6.487725] [ffff000013e8f774] *pgd=000000107ffb0003, *pud=000000107ffb0003, *pmd=000000107ffa0003, *pte=00e8001053bf0f13
[    6.487737] Internal error: Oops: 96000021 [#1] SMP
[    6.487742] Modules linked in: amdgpu(OE+) mmc_block amdchash(OE) i2c_algo_bit amd_sched(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops amdttm(OE) amdkcl(OE) drm i2c_core dw_mmc_pltfm(OE) dw_mmc(OE) tmfifo(OE) mmc_core virtio virtio_ring
[    6.487781] CPU: 11 PID: 275 Comm: systemd-udevd Tainted: G           OE  ------------   4.14.0-49.el7a.aarch64 #1
[    6.487784] task: ffff800fd33caa00 task.stack: ffff000013e80000
[    6.487800] PC is at change_bit+0x18/0x2c
[    6.488435] LR is at gmc_v9_0_late_init+0x94/0x420 [amdgpu]
[    6.488438] pc : [<ffff0000087f3848>] lr : [<ffff00000132145c>] pstate: 20000005
[    6.488439] sp : ffff000013e8f700
[    6.488441] x29: ffff000013e8f700 x28: ffff800fd3883050 
[    6.488446] x27: 0000000000000000 x26: 0000000000000001 
[    6.488451] x25: ffff800fd3880000 x24: ffff000008d13c08 
[    6.488454] x23: ffff0000014ea4d0 x22: 000000000000000a 
[    6.488458] x21: ffff800fd3883430 x20: ffff800fd3886e50 
[    6.488462] x19: ffff800fd3880000 x18: ffff0000014976f8 
[    6.488466] x17: 0000000000000000 x16: 0000000000000000 
[    6.488470] x15: 0000000000000000 x14: 3020627568206e6f 
[    6.488473] x13: 20303120676e6520 x12: 35b1aca30dc27500 
[    6.488477] x11: ffff0000093e4b30 x10: 0000000000000000 
[    6.488481] x9 : 00000000000f35ac x8 : ffff0000093f1b57 
[    6.488485] x7 : 0000000000000000 x6 : 000000003c6bd6e4 
[    6.488489] x5 : ffff800ffba16358 x4 : 0000000000000000 
[    6.488492] x3 : 0000000000000001 x2 : 0000000000000001 
[    6.488496] x1 : ffff000013e8f774 x0 : 0000000000000000 
[    6.488501] Process systemd-udevd (pid: 275, stack limit = 0xffff000013e80000)
[    6.488503] Call trace:
[    6.488507] Exception stack(0xffff000013e8f5c0 to 0xffff000013e8f700)
[    6.488511] f5c0: 0000000000000000 ffff000013e8f774 0000000000000001 0000000000000001
[    6.488514] f5e0: 0000000000000000 ffff800ffba16358 000000003c6bd6e4 0000000000000000
[    6.488517] f600: ffff0000093f1b57 00000000000f35ac 0000000000000000 ffff0000093e4b30
[    6.488520] f620: 35b1aca30dc27500 20303120676e6520 3020627568206e6f 0000000000000000
[    6.488523] f640: 0000000000000000 0000000000000000 ffff0000014976f8 ffff800fd3880000
[    6.488525] f660: ffff800fd3886e50 ffff800fd3883430 000000000000000a ffff0000014ea4d0
[    6.488528] f680: ffff000008d13c08 ffff800fd3880000 0000000000000001 0000000000000000
[    6.488531] f6a0: ffff800fd3883050 ffff000013e8f700 ffff00000132145c ffff000013e8f700
[    6.488534] f6c0: ffff0000087f3848 0000000020000005 ffff800fd3884cb8 000000000000000b
[    6.488537] f6e0: 0001000000000000 ffff800ffba16358 ffff000013e8f700 ffff0000087f3848
[    6.488543] [<ffff0000087f3848>] change_bit+0x18/0x2c
[    6.489145] [<ffff00000129131c>] amdgpu_device_ip_late_init+0x5c/0x150 [amdgpu]
[    6.489728] [<ffff000001293fd8>] amdgpu_device_init+0x12d0/0x1998 [amdgpu]
[    6.490307] [<ffff0000012976f0>] amdgpu_driver_load_kms+0x88/0x2f8 [amdgpu]
[    6.490396] [<ffff000000cb81a4>] drm_dev_register+0x154/0x1e0 [drm]
[    6.490994] [<ffff000001290640>] amdgpu_pci_probe+0xf8/0x200 [amdgpu]
[    6.491003] [<ffff000008447864>] local_pci_probe+0x48/0xb0
[    6.491007] [<ffff000008448cd0>] pci_device_probe+0x150/0x1b4
[    6.491014] [<ffff00000853d34c>] driver_probe_device+0x264/0x448
[    6.491018] [<ffff00000853d630>] __driver_attach+0x100/0x118
[    6.491021] [<ffff00000853ac50>] bus_for_each_dev+0x78/0xc8
[    6.491025] [<ffff00000853cb40>] driver_attach+0x30/0x38
[    6.491028] [<ffff00000853c488>] bus_add_driver+0x1f0/0x294
[    6.491032] [<ffff00000853e3a4>] driver_register+0x70/0x110
[    6.491035] [<ffff000008446e7c>] __pci_register_driver+0x68/0x74
[    6.491629] [<ffff00000163006c>] amdgpu_init+0x6c/0xd8 [amdgpu]
[    6.491636] [<ffff000008083790>] do_one_initcall+0x54/0x158
[    6.491642] [<ffff000008171bfc>] do_init_module+0x64/0x1e4
[    6.491646] [<ffff000008170800>] load_module+0x1ca4/0x20cc
[    6.491649] [<ffff000008170eb0>] SyS_finit_module+0xdc/0x108
[    6.491653] Exception stack(0xffff000013e8fec0 to 0xffff000013e90000)
[    6.491657] fec0: 0000000000000015 0000aaaacf1fcbe0 0000000000000000 0000000000000015
[    6.491660] fee0: 0000000000000000 000000000000007c 0000aaaacf1fa0a0 0000aaaacf1fa0a0
[    6.491663] ff00: 0000000000000111 0000fffff339c030 0000aaaacf1fa108 0000000000000006
[    6.491666] ff20: 0000000000000018 000000005c8a7711 0000aaaacf1e1079 0000ef714cbdd5b8
[    6.491668] ff40: 0000ffffa93f02e8 0000ffffa925a6d0 0000fffff339df70 0000aaaacf1fa0a0
[    6.491671] ff60: 0000aaaacf1fcbe0 0000aaaacf1f9540 0000000000000000 0000000000020000
[    6.491674] ff80: 0000aaaacf1f8bf0 0000000000000000 0000aaaacf1fcbef 0000000000000000
[    6.491677] ffa0: 0000aaaaae916af0 0000fffff339e1a0 0000ffffa93c8dbc 0000fffff339e1a0
[    6.491680] ffc0: 0000ffffa925a6f4 0000000060000000 0000000000000015 0000000000000111
[    6.491683] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    6.491686] [<ffff000008083544>] el0_svc_naked+0x38/0x3c
[    6.491692] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22) 
[    6.491727] ---[ end trace 5ea31d31904d511c ]---
[    6.491730] Kernel panic - not syncing: Fatal exception
[    6.491748] SMP: stopping secondary CPUs
[    6.494641] Kernel Offset: disabled
[    6.494645] CPU features: 0x1802008
[    6.494647] Memory Limit: none
[    7.047640] ---[ end Kernel panic - not syncing: Fatal exception
[    7.053674] ------------[ cut here ]------------
[    7.058290] WARNING: CPU: 11 PID: 275 at kernel/sched/core.c:1179 set_task_cpu+0x1b0/0x1c8
[    7.066541] Modules linked in: amdgpu(OE+) mmc_block amdchash(OE) i2c_algo_bit amd_sched(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops amdttm(OE) amdkcl(OE) drm i2c_core dw_mmc_pltfm(OE) dw_mmc(OE) tmfifo(OE) mmc_core virtio virtio_ring
[    7.089160] CPU: 11 PID: 275 Comm: systemd-udevd Tainted: G      D    OE  ------------   4.14.0-49.el7a.aarch64 #1
[    7.099495] task: ffff800fd33caa00 task.stack: ffff000013e80000
[    7.105404] PC is at set_task_cpu+0x1b0/0x1c8
[    7.109750] LR is at try_to_wake_up+0x170/0x454
[    7.114269] pc : [<ffff0000080fffe8>] lr : [<ffff000008100930>] pstate: 60000085
[    7.121652] sp : ffff0000097af9f0
[    7.124955] x29: ffff0000097af9f0 x28: 0000000000000000 
[    7.130258] x27: ffff800fd248e150 x26: ffff000008d13000 
[    7.135562] x25: ffff000008ccc000 x24: 0000000000000080 
[    7.140865] x23: 0000000000000004 x22: ffff000008d13000 
[    7.146168] x21: ffff800fd248e72c x20: 0000000000000000 
[    7.151471] x19: ffff800fd248dd00 x18: ffff0000014976f8 
[    7.156774] x17: 0000000000000001 x16: 0000000000000007 
[    7.162077] x15: 0000000000000000 x14: 000098967ff67698 
[    7.167380] x13: 000000000017d784 x12: 0000000000000007 
[    7.172683] x11: 7fffffffffffffff x10: 0000000000000002 
[    7.177986] x9 : 0000000300000000 x8 : 0000000000000000 
[    7.183289] x7 : 0000000000000000 x6 : 0000000000000010 
[    7.188592] x5 : 0000000000000000 x4 : 000000000000ffff 
[    7.193895] x3 : 0000000000000000 x2 : 0000000000000000 
[    7.199198] x1 : ffff000008d1b000 x0 : 0000000000000800 
[    7.204501] Call trace:
[    7.206938] Exception stack(0xffff0000097af8b0 to 0xffff0000097af9f0)
[    7.213367] f8a0:                                   0000000000000800 ffff000008d1b000
[    7.221186] f8c0: 0000000000000000 0000000000000000 000000000000ffff 0000000000000000
[    7.229004] f8e0: 0000000000000010 0000000000000000 0000000000000000 0000000300000000
[    7.236823] f900: 0000000000000002 7fffffffffffffff 0000000000000007 000000000017d784
[    7.244641] f920: 000098967ff67698 0000000000000000 0000000000000007 0000000000000001
[    7.252458] f940: ffff0000014976f8 ffff800fd248dd00 0000000000000000 ffff800fd248e72c
[    7.260276] f960: ffff000008d13000 0000000000000004 0000000000000080 ffff000008ccc000
[    7.268094] f980: ffff000008d13000 ffff800fd248e150 0000000000000000 ffff0000097af9f0
[    7.275912] f9a0: ffff000008100930 ffff0000097af9f0 ffff0000080fffe8 0000000060000085
[    7.283730] f9c0: ffff0000097afa20 ffff000008100ba4 0001000000000000 0000000000000001
[    7.291548] f9e0: ffff0000097af9f0 ffff0000080fffe8
[    7.296416] [<ffff0000080fffe8>] set_task_cpu+0x1b0/0x1c8
[    7.301804] [<ffff000008100930>] try_to_wake_up+0x170/0x454
[    7.307366] [<ffff000008100d18>] default_wake_function+0x30/0x3c
[    7.313365] [<ffff00000811caac>] __wake_up_common+0xa8/0x1a4
[    7.319015] [<ffff00000811cd0c>] __wake_up_locked+0x3c/0x48
[    7.324580] [<ffff0000082ecc90>] ep_poll_callback+0xbc/0x29c
[    7.330230] [<ffff00000811caac>] __wake_up_common+0xa8/0x1a4
[    7.335879] [<ffff00000811cc44>] __wake_up_common_lock+0x9c/0xe0
[    7.341875] [<ffff00000811ccc4>] __wake_up+0x3c/0x48
[    7.346831] [<ffff000008137b34>] wake_up_klogd_work_func+0x4c/0x68
[    7.353003] [<ffff0000081df390>] irq_work_run_list+0x78/0xa8
[    7.358651] [<ffff0000081df650>] irq_work_tick+0x48/0x60
[    7.363954] [<ffff000008154590>] update_process_times+0x44/0x5c
[    7.369865] [<ffff0000081655c0>] tick_sched_handle.isra.14+0x38/0x70
[    7.376209] [<ffff000008165640>] tick_sched_timer+0x48/0x88
[    7.381771] [<ffff000008154e78>] __hrtimer_run_queues+0x150/0x2d0
[    7.387853] [<ffff000008155860>] hrtimer_interrupt+0xa0/0x1d4
[    7.393593] [<ffff00000867517c>] arch_timer_handler_phys+0x3c/0x48
[    7.399762] [<ffff00000813e858>] handle_percpu_devid_irq+0x98/0x210
[    7.406019] [<ffff0000081389cc>] generic_handle_irq+0x34/0x4c
[    7.411754] [<ffff0000081390b8>] __handle_domain_irq+0x6c/0xc4
[    7.417576] [<ffff0000080816c8>] gic_handle_irq+0xa0/0x1b8
[    7.423050] Exception stack(0xffff000013e8f200 to 0xffff000013e8f340)
[    7.429479] f200: 0000000000000034 35b1aca30dc27500 35b1aca30dc27500 ffff800ffba12340
[    7.437297] f220: 0000000000000000 0000000000000001 ffff800ffba12338 3a676e69636e7973
[    7.445115] f240: ffff000008518d44 00000000000002c8 ffff000013e8efe0 ffff000013e8efe0
[    7.452933] f260: 35b1aca30dc27500 ffff000008a6cae8 3131303030303030 0000000000000000
[    7.460751] f280: 0000000000000007 0000000000000001 ffff0000014976f8 0000000000000000
[    7.468569] f2a0: ffff000009351000 0000000000000000 0000000000000000 ffff0000093515f8
[    7.476387] f2c0: ffff0000093515f8 ffff800fd3880000 0000000000000001 0000000000000000
[    7.484205] f2e0: ffff800fd33caa00 ffff000013e8f340 ffff0000080d28ac ffff000013e8f340
[    7.492024] f300: ffff0000080d28b0 0000000060000005 0000000000000007 ffff800ffba12340
[    7.499842] f320: 0001000000000000 0000000000000001 ffff000013e8f340 ffff0000080d28b0
[    7.507661] [<ffff000008082fb0>] el1_irq+0xb0/0x140
[    7.512531] [<ffff0000080d28b0>] panic+0x254/0x2a0
[    7.517313] [<ffff000008089268>] die+0x194/0x1a0
[    7.521921] [<ffff00000809bc34>] __do_kernel_fault+0xa8/0xfc
[    7.527569] [<ffff00000809bccc>] do_bad_area+0x44/0x98
[    7.532697] [<ffff00000809bd50>] do_alignment_fault+0x30/0x40
[    7.538432] [<ffff000008081334>] do_mem_abort+0x64/0xe4
[    7.543645] Exception stack(0xffff000013e8f5c0 to 0xffff000013e8f700)
[    7.550075] f5c0: 0000000000000000 ffff000013e8f774 0000000000000001 0000000000000001
[    7.557893] f5e0: 0000000000000000 ffff800ffba16358 000000003c6bd6e4 0000000000000000
[    7.565711] f600: ffff0000093f1b57 00000000000f35ac 0000000000000000 ffff0000093e4b30
[    7.573530] f620: 35b1aca30dc27500 20303120676e6520 3020627568206e6f 0000000000000000
[    7.581348] f640: 0000000000000000 0000000000000000 ffff0000014976f8 ffff800fd3880000
[    7.589166] f660: ffff800fd3886e50 ffff800fd3883430 000000000000000a ffff0000014ea4d0
[    7.596984] f680: ffff000008d13c08 ffff800fd3880000 0000000000000001 0000000000000000
[    7.604802] f6a0: ffff800fd3883050 ffff000013e8f700 ffff00000132145c ffff000013e8f700
[    7.612620] f6c0: ffff0000087f3848 0000000020000005 ffff800fd3884cb8 000000000000000b
[    7.620438] f6e0: 0001000000000000 ffff800ffba16358 ffff000013e8f700 ffff0000087f3848
[    7.628256] [<ffff000008082dd0>] el1_da+0x20/0x80
[    7.632951] [<ffff0000087f3848>] change_bit+0x18/0x2c
[    7.638595] [<ffff00000129131c>] amdgpu_device_ip_late_init+0x5c/0x150 [amdgpu]
[    7.646479] [<ffff000001293fd8>] amdgpu_device_init+0x12d0/0x1998 [amdgpu]
[    7.653922] [<ffff0000012976f0>] amdgpu_driver_load_kms+0x88/0x2f8 [amdgpu]
[    7.660958] [<ffff000000cb81a4>] drm_dev_register+0x154/0x1e0 [drm]
[    7.667806] [<ffff000001290640>] amdgpu_pci_probe+0xf8/0x200 [amdgpu]
[    7.674241] [<ffff000008447864>] local_pci_probe+0x48/0xb0
[    7.679716] [<ffff000008448cd0>] pci_device_probe+0x150/0x1b4
[    7.685454] [<ffff00000853d34c>] driver_probe_device+0x264/0x448
[    7.691450] [<ffff00000853d630>] __driver_attach+0x100/0x118
[    7.697098] [<ffff00000853ac50>] bus_for_each_dev+0x78/0xc8
[    7.702660] [<ffff00000853cb40>] driver_attach+0x30/0x38
[    7.707961] [<ffff00000853c488>] bus_add_driver+0x1f0/0x294
[    7.713523] [<ffff00000853e3a4>] driver_register+0x70/0x110
[    7.719085] [<ffff000008446e7c>] __pci_register_driver+0x68/0x74
[    7.725672] [<ffff00000163006c>] amdgpu_init+0x6c/0xd8 [amdgpu]
[    7.731584] [<ffff000008083790>] do_one_initcall+0x54/0x158
[    7.737147] [<ffff000008171bfc>] do_init_module+0x64/0x1e4
[    7.742621] [<ffff000008170800>] load_module+0x1ca4/0x20cc
[    7.748096] [<ffff000008170eb0>] SyS_finit_module+0xdc/0x108
[    7.753745] Exception stack(0xffff000013e8fec0 to 0xffff000013e90000)
[    7.760174] fec0: 0000000000000015 0000aaaacf1fcbe0 0000000000000000 0000000000000015
[    7.767993] fee0: 0000000000000000 000000000000007c 0000aaaacf1fa0a0 0000aaaacf1fa0a0
[    7.775811] ff00: 0000000000000111 0000fffff339c030 0000aaaacf1fa108 0000000000000006
[    7.783629] ff20: 0000000000000018 000000005c8a7711 0000aaaacf1e1079 0000ef714cbdd5b8
[    7.791448] ff40: 0000ffffa93f02e8 0000ffffa925a6d0 0000fffff339df70 0000aaaacf1fa0a0
[    7.799266] ff60: 0000aaaacf1fcbe0 0000aaaacf1f9540 0000000000000000 0000000000020000
[    7.807085] ff80: 0000aaaacf1f8bf0 0000000000000000 0000aaaacf1fcbef 0000000000000000
[    7.814903] ffa0: 0000aaaaae916af0 0000fffff339e1a0 0000ffffa93c8dbc 0000fffff339e1a0
[    7.822721] ffc0: 0000ffffa925a6f4 0000000060000000 0000000000000015 0000000000000111
[    7.830539] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    7.838357] [<ffff000008083544>] el0_svc_naked+0x38/0x3c
[    7.843658] ---[ end trace 5ea31d31904d511d ]---

We could try using an upstream kernel, but it might be easiest to just roll back to 2.0 again.

FinnStokes avatar Mar 14 '19 15:03 FinnStokes

@FinnStokes change_bit problem has been solved by this patch: https://lists.freedesktop.org/archives/amd-gfx/2019-March/032365.html

mfk530 avatar Mar 20 '19 05:03 mfk530

@mfk530 Thanks for that information. I had not noticed your pull request #74 which showed up after I opened this issue. Applying the patch at https://lists.freedesktop.org/archives/amd-gfx/2019-March/032365.html seems to have resolved my issues, and the system is booting successfully again. I'll run some tests to see if the card is working properly.

FinnStokes avatar Mar 20 '19 16:03 FinnStokes

clinfo ran fine, but running one of my OpenCL tests crashed the machine with

Message from syslogd at Mar 20 18:09:50 ...
 kernel:Internal error: Oops: 96000005 [#1] SMP

I'll look into this more tomorrow when I have physical access to reset the machine.

FinnStokes avatar Mar 20 '19 17:03 FinnStokes

It looks like when I run any OpenCL code, I get a bunch of corrected PCIe errors, then the program hangs:

[  589.176607] pcieport 0000:00:00.0: AER: Corrected error received: id=0700
[  589.185842] pcieport 0000:07:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0700(Transmitter ID)
[  589.196513] pcieport 0000:07:00.0:   device [15b3:1975] error status/mask=00003000/00002000
[  589.204990] pcieport 0000:07:00.0:    [12] Replay Timer Timeout  
[  589.286599] pcieport 0000:00:00.0: AER: Corrected error received: id=0700
[  589.295878] pcieport 0000:07:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0700(Transmitter ID)
[  589.306560] pcieport 0000:07:00.0:   device [15b3:1975] error status/mask=00001000/00002000
[  589.315008] pcieport 0000:07:00.0:    [12] Replay Timer Timeout  
[  589.506593] pcieport 0000:00:00.0: AER: Corrected error received: id=0700
[  589.518191] pcieport 0000:07:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0700(Transmitter ID)
[  589.528845] pcieport 0000:07:00.0:   device [15b3:1975] error status/mask=00001000/00002000
[  589.537266] pcieport 0000:07:00.0:    [12] Replay Timer Timeout  
[  589.946601] pcieport 0000:00:00.0: AER: Multiple Corrected error received: id=0700
[  589.972027] pcieport 0000:07:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0700(Transmitter ID)
[  589.982731] pcieport 0000:07:00.0:   device [15b3:1975] error status/mask=00001000/00002000
[  589.991234] pcieport 0000:07:00.0:    [12] Replay Timer Timeout  
[  589.997441] pcieport 0000:07:00.0:   Error of this Agent(0700) is reported first
[  590.004963] pcieport 0000:08:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0800(Transmitter ID)
[  590.015667] pcieport 0000:08:00.0:   device [1022:1470] error status/mask=000030c0/00002000
[  590.024166] pcieport 0000:08:00.0:    [ 6] Bad TLP               
[  590.030365] pcieport 0000:08:00.0:    [ 7] Bad DLLP              
[  590.036577] pcieport 0000:08:00.0:    [12] Replay Timer Timeout  
[  590.276577] pcieport 0000:00:00.0: AER: Multiple Corrected error received: id=0700
[  590.296328] pcieport 0000:07:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0700(Transmitter ID)
[  590.307019] pcieport 0000:07:00.0:   device [15b3:1975] error status/mask=00001000/00002000
[  590.315492] pcieport 0000:07:00.0:    [12] Replay Timer Timeout  
[  590.321687] pcieport 0000:07:00.0:   Error of this Agent(0700) is reported first
[  590.329202] pcieport 0000:08:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0800(Receiver ID)
[  590.339626] pcieport 0000:08:00.0:   device [1022:1470] error status/mask=00000040/00002000
[  590.348112] pcieport 0000:08:00.0:    [ 6] Bad TLP               
[  590.443736] pcieport 0000:00:00.0: AER: Multiple Corrected error received: id=0700
[  590.463410] pcieport 0000:08:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0800(Transmitter ID)
[  590.474100] pcieport 0000:08:00.0:   device [1022:1470] error status/mask=00001040/00002000
[  590.482581] pcieport 0000:08:00.0:    [ 6] Bad TLP               
[  590.488779] pcieport 0000:08:00.0:    [12] Replay Timer Timeout  
[  590.836573] pcieport 0000:00:00.0: AER: Corrected error received: id=0700
[  590.843430] pcieport 0000:07:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0700(Transmitter ID)
[  590.853972] pcieport 0000:07:00.0:   device [15b3:1975] error status/mask=00001000/00002000
[  590.862339] pcieport 0000:07:00.0:    [12] Replay Timer Timeout  
[  590.946594] pcieport 0000:00:00.0: AER: Corrected error received: id=0700
[  590.953450] pcieport 0000:07:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0700(Transmitter ID)
[  590.964005] pcieport 0000:07:00.0:   device [15b3:1975] error status/mask=00001000/00002000
[  590.972379] pcieport 0000:07:00.0:    [12] Replay Timer Timeout

Nothing more seems to happen until I attempt to kill the program, which leads to a kernel panic:

[ 1723.782124] qcm fence wait loop timeout expired
[ 1723.786655] The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[ 1723.795287] amdgpu 0000:0a:00.0: GPU reset begin!
[ 1723.952216] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[ 1723.960300] Mem abort info:
[ 1723.963102]   Exception class = DABT (current EL), IL = 32 bits
[ 1723.969011]   SET = 0, FnV = 0
[ 1723.972055]   EA = 0, S1PTW = 0
[ 1723.975195] Data abort info:
[ 1723.978066]   ISV = 0, ISS = 0x00000005
[ 1723.981891]   CM = 0, WnR = 0
[ 1723.984862] user pgtable: 64k pages, 48-bit VAs, pgd = ffff800f08b34a00
[ 1723.991467] [0000000000000000] *pgd=0000000000000000, *pud=0000000000000000
[ 1723.998435] Internal error: Oops: 96000005 [#1] SMP
[ 1724.003306] Modules linked in: vfat fat mlx5_ib ib_core mlx5_core mlxfw devlink ptp ipmi_ssif pps_core virtio_net ipmi_devintf ipmi_msghandler virtio_console uio_pdrv_genirq crc32_ce uio sbsa_gwdt ip_tables xfs libcrc32c mmc_block amdgpu(OE) amdchash(OE) i2c_algo_bit amd_sched(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops amdttm(OE) amdkcl(OE) drm i2c_core dw_mmc_pltfm(OE) dw_mmc(OE) tmfifo(OE) mmc_core virtio virtio_ring
[ 1724.042392] CPU: 5 PID: 118 Comm: kworker/5:1 Tainted: G           OE  ------------   4.14.0-49.el7a.aarch64 #1
[ 1724.053060] Workqueue: events kfd_process_hw_exception [amdgpu]
[ 1724.058971] task: ffff800fc8482200 task.stack: ffff000011860000
[ 1724.065464] PC is at soc15_baco_program_registers+0x74/0x1e0 [amdgpu]
[ 1724.072477] LR is at soc15_baco_program_registers+0x118/0x1e0 [amdgpu]
[ 1724.078995] pc : [<ffff0000013ebfac>] lr : [<ffff0000013ec050>] pstate: 80000005
[ 1724.086378] sp : ffff00001186fba0
[ 1724.089681] x29: ffff00001186fba0 x28: ffff0000014c5470 
[ 1724.094986] x27: ffff0000014c5724 x26: ffff800fd2688400 
[ 1724.100289] x25: ffff800fd3400000 x24: 0000000000000008 
[ 1724.105594] x23: 0000000000000008 x22: 0000000000000e2b 
[ 1724.110897] x21: ffff800fd3400000 x20: 0000000000000003 
[ 1724.116200] x19: ffff0000014c5694 x18: ffff0000334afb38 
[ 1724.121503] x17: 0000000000000000 x16: 0000000000000000 
[ 1724.126807] x15: 0000000000000007 x14: 0000000000000000 
[ 1724.132110] x13: 0000000020130307 x12: 0000000000000000 
[ 1724.137413] x11: 0000000000000000 x10: 0000000000000d00 
[ 1724.142717] x9 : ffff00001186fa10 x8 : ffff800fc8482f60 
[ 1724.148020] x7 : 0000000100022be3 x6 : 0000000000017f92 
[ 1724.153323] x5 : 0000003fee610554 x4 : 0000000000001be6 
[ 1724.158626] x3 : 0000000000000000 x2 : 0000000000000000 
[ 1724.163929] x1 : 0000000000000003 x0 : 0000000000000000 
[ 1724.169234] Process kworker/5:1 (pid: 118, stack limit = 0xffff000011860000)
[ 1724.176271] Call trace:
[ 1724.178708] Exception stack(0xffff00001186fa60 to 0xffff00001186fba0)
[ 1724.185138] fa60: 0000000000000000 0000000000000003 0000000000000000 0000000000000000
[ 1724.192957] fa80: 0000000000001be6 0000003fee610554 0000000000017f92 0000000100022be3
[ 1724.200776] faa0: ffff800fc8482f60 ffff00001186fa10 0000000000000d00 0000000000000000
[ 1724.208594] fac0: 0000000000000000 0000000020130307 0000000000000000 0000000000000007
[ 1724.216412] fae0: 0000000000000000 0000000000000000 ffff0000334afb38 ffff0000014c5694
[ 1724.224230] fb00: 0000000000000003 ffff800fd3400000 0000000000000e2b 0000000000000008
[ 1724.232048] fb20: 0000000000000008 ffff800fd3400000 ffff800fd2688400 ffff0000014c5724
[ 1724.239866] fb40: ffff0000014c5470 ffff00001186fba0 ffff0000013ec050 ffff00001186fba0
[ 1724.247685] fb60: ffff0000013ebfac 0000000080000005 ffff0000014c5670 0000000000000003
[ 1724.255504] fb80: ffffffffffffffff 0000000000000e2b ffff00001186fba0 ffff0000013ebfac
[ 1724.263898] [<ffff0000013ebfac>] soc15_baco_program_registers+0x74/0x1e0 [amdgpu]
[ 1724.271942] [<ffff0000013ec324>] vega10_baco_set_state+0xfc/0x120 [amdgpu]
[ 1724.279377] [<ffff0000013ed93c>] pp_set_asic_baco_state+0x5c/0x80 [amdgpu]
[ 1724.286803] [<ffff0000013178d8>] soc15_asic_reset+0x80/0x188 [amdgpu]
[ 1724.293793] [<ffff00000129579c>] amdgpu_device_gpu_recover+0x48c/0x8e0 [amdgpu]
[ 1724.301655] [<ffff000001376294>] amdgpu_amdkfd_gpu_reset+0x34/0x40 [amdgpu]
[ 1724.309174] [<ffff00000138a4cc>] kfd_process_hw_exception+0x24/0x30 [amdgpu]
[ 1724.316226] [<ffff0000080ed444>] process_one_work+0x16c/0x380
[ 1724.321963] [<ffff0000080ed6b8>] worker_thread+0x60/0x40c
[ 1724.327355] [<ffff0000080f42e0>] kthread+0x10c/0x138
[ 1724.332312] [<ffff000008084af8>] ret_from_fork+0x10/0x18
[ 1724.337616] Code: 9129f000 f8607b20 b9400e62 b9400261 (b8627816) 
[ 1724.343731] ---[ end trace f6bfdaa6585aa085 ]---
[ 1724.348338] Kernel panic - not syncing: Fatal exception
[ 1724.353561] SMP: stopping secondary CPUs
[ 1724.357489] Kernel Offset: disabled
[ 1724.360969] CPU features: 0x1802008
[ 1724.364447] Memory Limit: none
[ 1724.367501] ---[ end Kernel panic - not syncing: Fatal exception
[ 1724.373531] ------------[ cut here ]------------
[ 1724.378152] WARNING: CPU: 5 PID: 118 at kernel/sched/core.c:1179 set_task_cpu+0x1b0/0x1c8
[ 1724.386317] Modules linked in: vfat fat mlx5_ib ib_core mlx5_core mlxfw devlink ptp ipmi_ssif pps_core virtio_net ipmi_devintf ipmi_msghandler virtio_console uio_pdrv_genirq crc32_ce uio sbsa_gwdt ip_tables xfs libcrc32c mmc_block amdgpu(OE) amdchash(OE) i2c_algo_bit amd_sched(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops amdttm(OE) amdkcl(OE) drm i2c_core dw_mmc_pltfm(OE) dw_mmc(OE) tmfifo(OE) mmc_core virtio virtio_ring
[ 1724.425379] CPU: 5 PID: 118 Comm: kworker/5:1 Tainted: G      D    OE  ------------   4.14.0-49.el7a.aarch64 #1
[ 1724.436035] Workqueue: events kfd_process_hw_exception [amdgpu]
[ 1724.441947] task: ffff800fc8482200 task.stack: ffff000011860000
[ 1724.447858] PC is at set_task_cpu+0x1b0/0x1c8
[ 1724.452204] LR is at try_to_wake_up+0x170/0x454
[ 1724.456723] pc : [<ffff0000080fffe8>] lr : [<ffff000008100930>] pstate: 60000085
[ 1724.464106] sp : ffff0000096ef9f0
[ 1724.467409] x29: ffff0000096ef9f0 x28: 0000000000000000 
[ 1724.472713] x27: ffff800fd19c0450 x26: ffff000008d13000 
[ 1724.478016] x25: ffff000008ccc000 x24: 0000000000000080 
[ 1724.483319] x23: 0000000000000004 x22: ffff000008d13000 
[ 1724.488622] x21: ffff800fd19c0a2c x20: 0000000000000000 
[ 1724.493925] x19: ffff800fd19c0000 x18: ffff0000334afb38 
[ 1724.499228] x17: 0000000000000001 x16: 0000000000000000 
[ 1724.504531] x15: 0000000000000007 x14: 0000000000000000 
[ 1724.509834] x13: ffff000008a6cae8 x12: 35b1aca30dc27500 
[ 1724.515137] x11: ffff00001186f440 x10: ffff00001186f440 
[ 1724.520440] x9 : 000000000000035b x8 : ffff000008518d44 
[ 1724.525743] x7 : 0000000000000000 x6 : 0000000000000010 
[ 1724.531046] x5 : 0000000000000000 x4 : 000000000000ffff 
[ 1724.536349] x3 : 0000000000000000 x2 : 0000000000000000 
[ 1724.541652] x1 : ffff000008d1b000 x0 : 0000000000000020 
[ 1724.546955] Call trace:
[ 1724.549392] Exception stack(0xffff0000096ef8b0 to 0xffff0000096ef9f0)
[ 1724.555822] f8a0:                                   0000000000000020 ffff000008d1b000
[ 1724.563640] f8c0: 0000000000000000 0000000000000000 000000000000ffff 0000000000000000
[ 1724.571459] f8e0: 0000000000000010 0000000000000000 ffff000008518d44 000000000000035b
[ 1724.579277] f900: ffff00001186f440 ffff00001186f440 35b1aca30dc27500 ffff000008a6cae8
[ 1724.587095] f920: 0000000000000000 0000000000000007 0000000000000000 0000000000000001
[ 1724.594914] f940: ffff0000334afb38 ffff800fd19c0000 0000000000000000 ffff800fd19c0a2c
[ 1724.602732] f960: ffff000008d13000 0000000000000004 0000000000000080 ffff000008ccc000
[ 1724.610551] f980: ffff000008d13000 ffff800fd19c0450 0000000000000000 ffff0000096ef9f0
[ 1724.618369] f9a0: ffff000008100930 ffff0000096ef9f0 ffff0000080fffe8 0000000060000085
[ 1724.626188] f9c0: ffff0000096efa20 ffff000008100ba4 0001000000000000 000000000000000a
[ 1724.634005] f9e0: ffff0000096ef9f0 ffff0000080fffe8
[ 1724.638873] [<ffff0000080fffe8>] set_task_cpu+0x1b0/0x1c8
[ 1724.644262] [<ffff000008100930>] try_to_wake_up+0x170/0x454
[ 1724.649823] [<ffff000008100d18>] default_wake_function+0x30/0x3c
[ 1724.655822] [<ffff00000811caac>] __wake_up_common+0xa8/0x1a4
[ 1724.661471] [<ffff00000811cd0c>] __wake_up_locked+0x3c/0x48
[ 1724.667036] [<ffff0000082ecc90>] ep_poll_callback+0xbc/0x29c
[ 1724.672685] [<ffff00000811caac>] __wake_up_common+0xa8/0x1a4
[ 1724.678334] [<ffff00000811cc44>] __wake_up_common_lock+0x9c/0xe0
[ 1724.684330] [<ffff00000811ccc4>] __wake_up+0x3c/0x48
[ 1724.689286] [<ffff000008137b34>] wake_up_klogd_work_func+0x4c/0x68
[ 1724.695458] [<ffff0000081df390>] irq_work_run_list+0x78/0xa8
[ 1724.701106] [<ffff0000081df650>] irq_work_tick+0x48/0x60
[ 1724.706409] [<ffff000008154590>] update_process_times+0x44/0x5c
[ 1724.712320] [<ffff0000081655c0>] tick_sched_handle.isra.14+0x38/0x70
[ 1724.718663] [<ffff000008165640>] tick_sched_timer+0x48/0x88
[ 1724.724225] [<ffff000008154e78>] __hrtimer_run_queues+0x150/0x2d0
[ 1724.730307] [<ffff000008155860>] hrtimer_interrupt+0xa0/0x1d4
[ 1724.736046] [<ffff00000867517c>] arch_timer_handler_phys+0x3c/0x48
[ 1724.742216] [<ffff00000813e858>] handle_percpu_devid_irq+0x98/0x210
[ 1724.748473] [<ffff0000081389cc>] generic_handle_irq+0x34/0x4c
[ 1724.754209] [<ffff0000081390b8>] __handle_domain_irq+0x6c/0xc4
[ 1724.760031] [<ffff0000080816c8>] gic_handle_irq+0xa0/0x1b8
[ 1724.765505] Exception stack(0xffff00001186f660 to 0xffff00001186f7a0)
[ 1724.771935] f660: 0000000000000034 35b1aca30dc27500 35b1aca30dc27500 ffff800ffb8f2340
[ 1724.779753] f680: 0000000000000000 0000000000000001 ffff800ffb8f2338 65206c6174614620
[ 1724.787571] f6a0: ffff000008518d44 000000000000035b ffff00001186f440 ffff00001186f440
[ 1724.795389] f6c0: 35b1aca30dc27500 ffff000008a6cae8 0000000000000000 0000000000000007
[ 1724.803207] f6e0: 0000000000000007 0000000000000001 ffff0000334afb38 0000000000000000
[ 1724.811025] f700: ffff000009351000 0000000000000000 0000000000000000 ffff0000093515f8
[ 1724.818844] f720: ffff0000093515f8 ffff800fd3400000 ffff800fd2688400 ffff0000014c5724
[ 1724.826662] f740: ffff800fc8482200 ffff00001186f7a0 ffff0000080d28ac ffff00001186f7a0
[ 1724.834481] f760: ffff0000080d28b0 0000000060000005 ffffffffffffffff 0000000000000000
[ 1724.842299] f780: 0001000000000000 ffff800ffb8f6358 ffff00001186f7a0 ffff0000080d28b0
[ 1724.850117] [<ffff000008082fb0>] el1_irq+0xb0/0x140
[ 1724.854986] [<ffff0000080d28b0>] panic+0x254/0x2a0
[ 1724.859767] [<ffff000008089268>] die+0x194/0x1a0
[ 1724.864376] [<ffff00000809bc34>] __do_kernel_fault+0xa8/0xfc
[ 1724.870029] [<ffff000008812910>] do_page_fault+0x204/0x3cc
[ 1724.875504] [<ffff000008812b28>] do_translation_fault+0x50/0x5c
[ 1724.881413] [<ffff000008081334>] do_mem_abort+0x64/0xe4
[ 1724.886626] Exception stack(0xffff00001186fa60 to 0xffff00001186fba0)
[ 1724.893055] fa60: 0000000000000000 0000000000000003 0000000000000000 0000000000000000
[ 1724.900873] fa80: 0000000000001be6 0000003fee610554 0000000000017f92 0000000100022be3
[ 1724.908692] faa0: ffff800fc8482f60 ffff00001186fa10 0000000000000d00 0000000000000000
[ 1724.916510] fac0: 0000000000000000 0000000020130307 0000000000000000 0000000000000007
[ 1724.924328] fae0: 0000000000000000 0000000000000000 ffff0000334afb38 ffff0000014c5694
[ 1724.932146] fb00: 0000000000000003 ffff800fd3400000 0000000000000e2b 0000000000000008
[ 1724.939964] fb20: 0000000000000008 ffff800fd3400000 ffff800fd2688400 ffff0000014c5724
[ 1724.947782] fb40: ffff0000014c5470 ffff00001186fba0 ffff0000013ec050 ffff00001186fba0
[ 1724.955600] fb60: ffff0000013ebfac 0000000080000005 ffff0000014c5670 0000000000000003
[ 1724.963419] fb80: ffffffffffffffff 0000000000000e2b ffff00001186fba0 ffff0000013ebfac
[ 1724.971236] [<ffff000008082dd0>] el1_da+0x20/0x80
[ 1724.976512] [<ffff0000013ebfac>] soc15_baco_program_registers+0x74/0x1e0 [amdgpu]
[ 1724.984558] [<ffff0000013ec324>] vega10_baco_set_state+0xfc/0x120 [amdgpu]
[ 1724.991996] [<ffff0000013ed93c>] pp_set_asic_baco_state+0x5c/0x80 [amdgpu]
[ 1724.999423] [<ffff0000013178d8>] soc15_asic_reset+0x80/0x188 [amdgpu]
[ 1725.006406] [<ffff00000129579c>] amdgpu_device_gpu_recover+0x48c/0x8e0 [amdgpu]
[ 1725.014272] [<ffff000001376294>] amdgpu_amdkfd_gpu_reset+0x34/0x40 [amdgpu]
[ 1725.021794] [<ffff00000138a4cc>] kfd_process_hw_exception+0x24/0x30 [amdgpu]
[ 1725.028835] [<ffff0000080ed444>] process_one_work+0x16c/0x380
[ 1725.034571] [<ffff0000080ed6b8>] worker_thread+0x60/0x40c
[ 1725.039960] [<ffff0000080f42e0>] kthread+0x10c/0x138
[ 1725.044915] [<ffff000008084af8>] ret_from_fork+0x10/0x18
[ 1725.050215] ---[ end trace f6bfdaa6585aa086 ]---

On one test I got the following memory access fault, but I usually don't see it

Memory access fault by GPU node-1 (Agent handle: 0x36bcf4b0) on address 0x3fffff0000. Reason: Page not present or supervisor privilege.
Nearby memory map:
0x3ae00000, 0x200000, System
0x3b100000, 0x100000, System
0x4129280000, 0x81000, System

PtrInfo:
	Address: 0x3ae00000-0x3b000000/0x3ae00000-0x3b000000
	Size: 0x200000
	Type: 1
	Owner: 0x36bce9c0
	CanAccess: 1
		0x36bcf4b0
	In block: 0x3ae00000, 0x200000
PtrInfo:
	Address: 0x3b100000-0x3b200000/0x3b100000-0x3b200000
	Size: 0x100000
	Type: 1
	Owner: 0x36bce9c0
	CanAccess: 1
		0x36bcf4b0
	In block: 0x3b100000, 0x100000
PtrInfo:
	Address: 0x4129280000-0x4129301000/0x4129280000-0x4129301000
	Size: 0x81000
	Type: 1
	Owner: 0x36bce9c0
	CanAccess: 1
		0x36bcf4b0
	In block: 0x4129280000, 0x90000

Message from syslogd at Mar 21 15:37:12 ...
 kernel:Internal error: Oops: 96000007 [#1] SMP

FinnStokes avatar Mar 21 '19 15:03 FinnStokes

It looks more like a HW problem than a SW problem at this point. With all the GPUs enabled, you're getting various PCIe errors and what looks like memory corruption.

Make sure your power supply can handle all the GPUs.

fxkamd avatar Apr 16 '19 19:04 fxkamd

Closing off after 4 years of inactivity

kentrussell avatar Nov 10 '23 16:11 kentrussell