zig icon indicating copy to clipboard operation
zig copied to clipboard

Building Stage 3 causes system crash

Open Jan200101 opened this issue 2 years ago • 10 comments

Zig Version

e7d28344fa3ee81d6ad7ca5ce1f83d50d8502118

Steps to Reproduce and Observed Behavior

Unsure what causes this

Git bisect points to #13560 causing this, I know too little about wasm to be able to investigate where exactly this issue could be.

Building Stage 3 at e7d28344fa3ee81d6ad7ca5ce1f83d50d8502118 or later caused anything from a segmentation fault up to a full kernel panic.

System logs show nothing useful outside of zig2 terminating and some random system components failing hard (WiFi, IME, etc.)

lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   39 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Vendor ID:                       GenuineIntel
Model name:                      11th Gen Intel(R) Core(TM) i5-11300H @ 3.10GHz
CPU family:                      6
Model:                           140
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
Stepping:                        1
CPU(s) scaling MHz:              23%
CPU max MHz:                     4400.0000
CPU min MHz:                     400.0000
BogoMIPS:                        6220.80
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear ibt flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       192 KiB (4 instances)
L1i cache:                       128 KiB (4 instances)
L2 cache:                        5 MiB (4 instances)
L3 cache:                        8 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-7
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Expected Behavior

Build Zig post e7d28344fa3ee81d6ad7ca5ce1f83d50d8502118 successfully

Jan200101 avatar Feb 01 '23 11:02 Jan200101

While recreating this again in an attempt to get it to trigger a SIGSEGV I have found that the build/compiler_rt.c is causing multiple warnings about conflicting types for built-ins.

Could this be related?

Jan200101 avatar Feb 01 '23 11:02 Jan200101

SIGSEGV backtrace
#0  0x0000000000a7d206 in multi_array_list_MultiArrayList_28zig_Ast_TokenList__struct_18592_29_set (a0=0x7fffe6818c00, a1=0x50204, a2=...) at /home/sentry/git/zig/build/zig2.c:599018
#1  0x000000000085f3e1 in multi_array_list_MultiArrayList_28zig_Ast_TokenList__struct_18592_29_appendAssumeCapacity (a0=0x7fffe6818c00, a1=...) at /home/sentry/git/zig/build/zig2.c:404832
#2  0x000000000062af3a in multi_array_list_MultiArrayList_28zig_Ast_TokenList__struct_18592_29_append (a0=0x7fffe6818c00, a1=..., a2=...) at /home/sentry/git/zig/build/zig2.c:234708
#3  0x000000000077d343 in zig_parse_parse (a0=..., a1=...) at /home/sentry/git/zig/build/zig2.c:325166
#4  0x000000000087c2e2 in Module_astGenFile (a0=0x470af58, a1=0x7fffc48d22e0) at /home/sentry/git/zig/build/zig2.c:416759
#5  0x0000000000881d42 in Compilation_workerAstGenFile (a0=0x4e5c798, a1=0x7fffc48d22e0, a2=0x7fffffff02d0, a3=0x4e5cca0, a4=...) at /home/sentry/git/zig/build/zig2.c:418287
#6  0x0000000000a8ced2 in ThreadPool_spawn__anon_88746_Closure_runFn (a0=0x7fffd4976cf8) at /home/sentry/git/zig/build/zig2.c:605558
#7  0x00000000005ce90c in ThreadPool_worker (a0=0x7fffffffb180) at /home/sentry/git/zig/build/zig2.c:210184
#8  0x0000000000a3a7d7 in Thread_callFn__anon_85763 (a0=...) at /home/sentry/git/zig/build/zig2.c:567336
#9  0x00000000007e786d in Thread_PosixThreadImpl_spawn__anon_65683_Instance_entryFn (a0=0x282a3e0) at /home/sentry/git/zig/build/zig2.c:361576
#10 0x00007fffec8ae12d in start_thread () from /lib64/libc.so.6
#11 0x00007fffec92fbc0 in clone3 () from /lib64/libc.so.6

unsure if this is the same thing that causes the kernel crashes

Jan200101 avatar Feb 01 '23 12:02 Jan200101

@Jan200101 Can you specify your exact operating system / linux distribution + version? Maybe also the version of your installed system libc. (I don't know how to look that up myself, though the output of zig libc (from a working Zig version) is probably valuable.) I think both of those components might be related to the kernel panic at least.

rohlem avatar Feb 01 '23 17:02 rohlem

Sure thing.

I am using the KDE spin of Fedora Linux 37 running on x86_64. The exact glibc package I am using is glibc-2.36-9.fc37.x86_64, the sources for which are available here.

The output of zig libc is:

# The directory that contains `stdlib.h`.
# On POSIX-like systems, include directories be found with: `cc -E -Wp,-v -xc /dev/null`
include_dir=/usr/include

# The system-specific include directory. May be the same as `include_dir`.
# On Windows it's the directory that includes `vcruntime.h`.
# On POSIX it's the directory that includes `sys/errno.h`.
sys_include_dir=/usr/include

# The directory that contains `crt1.o` or `crt2.o`.
# On POSIX, can be found with `cc -print-file-name=crt1.o`.
# Not needed when targeting MacOS.
crt_dir=/usr/lib/gcc/x86_64-redhat-linux/12/../../../../lib64

# The directory that contains `vcruntime.lib`.
# Only needed when targeting MSVC on Windows.
msvc_lib_dir=

# The directory that contains `kernel32.lib`.
# Only needed when targeting MSVC on Windows.
kernel32_lib_dir=

# The directory that contains `crtbeginS.o` and `crtendS.o`
# Only needed when targeting Haiku.
gcc_dir=

for completion sake here is zig env:

{
 "zig_exe": "/usr/bin/zig",
 "lib_dir": "/usr/lib/zig",
 "std_dir": "/usr/lib/zig/std",
 "global_cache_dir": "/home/sentry/.cache/zig",
 "version": "0.10.0",
 "target": "x86_64-linux.6.1.8...6.1.8-gnu.2.36"
}

This zig build is from https://copr.fedorainfracloud.org/coprs/sentry/zig/ (mine)

Jan200101 avatar Feb 01 '23 19:02 Jan200101

Same thing is happening to me with 0.11.0-dev.1796+c9e02d3e6 built from this repo and from zig-bootstrap.

I'm running arch linux x86_64 with kernel 6.2.1-arch-1 on my framework laptop running with an i5-1135G7.

The output of zig libc is

# The directory that contains `stdlib.h`.
# On POSIX-like systems, include directories be found with: `cc -E -Wp,-v -xc /dev/null`
include_dir=/usr/include

# The system-specific include directory. May be the same as `include_dir`.
# On Windows it's the directory that includes `vcruntime.h`.
# On POSIX it's the directory that includes `sys/errno.h`.
sys_include_dir=/usr/include

# The directory that contains `crt1.o` or `crt2.o`.
# On POSIX, can be found with `cc -print-file-name=crt1.o`.
# Not needed when targeting MacOS.
crt_dir=/usr/lib/gcc/x86_64-pc-linux-gnu/12.2.1/../../../../lib

# The directory that contains `vcruntime.lib`.
# Only needed when targeting MSVC on Windows.
msvc_lib_dir=

# The directory that contains `kernel32.lib`.
# Only needed when targeting MSVC on Windows.
kernel32_lib_dir=

# The directory that contains `crtbeginS.o` and `crtendS.o`
# Only needed when targeting Haiku.
gcc_dir=

I'm using glibc-2.37-2 from arch's Core repo. I also built zig from https://github.com/ziglang/zig-bootstrap using ./build native-linux-gnu native

mov-rax avatar Mar 03 '23 20:03 mov-rax

Segfault is incomplete, it reliably causes a system crash

Jan200101 avatar Mar 04 '23 14:03 Jan200101

If the system crashes that is a kernel bug. Please file a Linux bug report.

andrewrk avatar Mar 04 '23 19:03 andrewrk

Reporting it to the Linux developers would be best, but its unclear if its limited to Linux or what layer it even occurs on (software, firmware, hardware).

There isn't a lot to report to the Linux developers to act upon. The best I can tell them is that the output of a wasm program converted to C which includes compiler-rt causes my CPU to lock up.

Zig is the only program that I (thus far) managed to reproduce it on which is why its reported here.

Jan200101 avatar Mar 04 '23 20:03 Jan200101

does this reproduce if you git clone and build from source? you mention above this is using fedora sentry

nektro avatar Mar 04 '23 21:03 nektro

does this reproduce if you git clone and build from source? you mention above this is using fedora sentry

yes, and this has been the only way I have been able to reproduce this.

with the info mov-rax provided, it appears to affect 11th gen mobile i5's, but the full scope is unclear.

Jan200101 avatar Mar 04 '23 21:03 Jan200101

I also faced to the same situation while compiling master branch or brew install zig --HEAD on macOS(x86) zig2 is compiled without problem, but on stage3 kernel got panicked and just rebooted.

EndlessArch avatar Mar 29 '23 08:03 EndlessArch

I also faced to the same situation while compiling master branch or brew install zig --HEAD on macOS(x86) zig2 is compiled without problem, but on stage3 kernel got panicked and just rebooted.

Can you provide some information about the system in question? Most notably which CPU model and MacOS version.

Jan200101 avatar Mar 30 '23 22:03 Jan200101

I also faced to the same situation while compiling master branch or brew install zig --HEAD on macOS(x86) zig2 is compiled without problem, but on stage3 kernel got panicked and just rebooted.

Can you provide some information about the system in question? Most notably which CPU model and MacOS version.

CPU: Intel(R) Core(TM) i3-1000NG4 CPU @ 1.10GHz OS: macOS Ventura 13.3

also posting this error message appeared after the reboot, just in case

panic(cpu 0 caller 0xffffff800a3dd585): port 0xffffff800a430156: invalid kobject type, got 880 wanted 1 @ipc_kobject.c:688
Panicked task 0xffffffa94dc9de18: 6 threads: pid 1705: zig2
Backtrace (CPU 0), panicked thread: 0xffffff9aeae89b30, Frame : Return Address
0xffffffb484066610 : 0xffffff8009c705fd 
0xffffffb484066660 : 0xffffff8009dc4b84 
0xffffffb4840666a0 : 0xffffff8009db4619 
0xffffffb484066700 : 0xffffff8009c10951 
0xffffffb484066720 : 0xffffff8009c708dd 
0xffffffb484066810 : 0xffffff8009c6ff87 
0xffffffb484066870 : 0xffffff800a3dd09b 
0xffffffb484066960 : 0xffffff800a3dd585 
0xffffffb484066970 : 0xffffff8009c76481 
0xffffffb4840669a0 : 0xffffff8009c79ee4 
0xffffffb4840669e0 : 0xffffff8009c71a55 
0xffffffb484067ec0 : 0xffffff8009c7233b 
0xffffffb484067f10 : 0xffffff8009db5032 
0xffffffb484067fa0 : 0xffffff8009c1087f 

Process name corresponding to current thread (0xffffff9aeae89b30): zig2

Mac OS version:
22E252

Kernel version:
Darwin Kernel Version 22.4.0: Mon Mar  6 21:00:17 PST 2023; root:xnu-8796.101.5~3/RELEASE_X86_64
Kernel UUID: CF2A42DA-3F7C-30C6-9433-6F2076FF1F94
roots installed: 0
KernelCache slide: 0x0000000009800000
KernelCache base:  0xffffff8009a00000
Kernel slide:      0x00000000098dc000
Kernel text base:  0xffffff8009adc000
__HIB  text base: 0xffffff8009900000
System model name: MacBookAir9,1 (Mac-0CFF9C7C2B63DF8D)
System shutdown begun: NO
Hibernation exit count: 0

EndlessArch avatar Apr 01 '23 15:04 EndlessArch

This is out of scope of Zig. Please contact the kernel development team and read the kernel documentation for how to report a bug.

andrewrk avatar Apr 10 '23 19:04 andrewrk

This is out of scope of Zig. Please contact the kernel development team and read the kernel documentation for how to report a bug.

Where should bug reports for MacOS be send?

Jan200101 avatar Apr 10 '23 19:04 Jan200101

I also faced to the same situation while compiling master branch or brew install zig --HEAD on macOS(x86) zig2 is compiled without problem, but on stage3 kernel got panicked and just rebooted.

Can you provide some information about the system in question? Most notably which CPU model and MacOS version.

CPU: Intel(R) Core(TM) i3-1000NG4 CPU @ 1.10GHz OS: macOS Ventura 13.3

also posting this error message appeared after the reboot, just in case

is that an 8GB MacBook Air with 1.5GB shared VRAM? maxrss is approx 5.2GB on my host to build stage3. I wonder how much memory is available after desktop login. (still, the kernel shouldn't be crashing).

Also, this is a complete stab in the dark, but something has puzzled me on macOS - it apparently has an 8MB default stacksize limit and I have seen other *BSD insist on > 16MB stack to build stage3. Might be worth a quick try to expand stacksize in the shell to 32MB before building zig.

mikdusan avatar Apr 10 '23 19:04 mikdusan

Where should bug reports for MacOS be send?

I don't know, maybe you can leave a flaming bag of poop on the doorstep of your local Apple store. However, this website is the Zig Programming Language issue tracker. Kernel panics are entirely out of scope.

If, however, you choose to accept my suggestion of this issue to be about the segfault that you mentioned rather than about the kernel panic, then it will become in scope and you might get some developer attention on it.

andrewrk avatar Apr 10 '23 19:04 andrewrk

Some more info: This appears to be fixed on my end by cdb9cc8f6bda4b4faa270278e3b67c4ef9246a84.

Kernel coredumps point towards general hardware failure, so I suspect it was caused by UB in C poking bits of the CPU that caused the microcode to fail.

Jan200101 avatar Apr 19 '23 08:04 Jan200101