sof icon indicating copy to clipboard operation
sof copied to clipboard

Failed "assert(buffer->is_shared)" in comp_buffer_connect()

Open marc-hb opened this issue 1 year ago • 20 comments

This is failing when fuzzing, see below.

This code came with commit 4a036999118cfa7e, @marcinszkudlinski can you please comment?

https://github.com/thesofproject/sof/actions/runs/10144692547/job/28048711573?pr=9338

==5054==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000 (pc 0x081aecfa bp 0xdbffadb8 sp 0xdbffacd0 T7)
==5054==The signal is caused by a WRITE memory access.
==5054==Hint: address points to the zero page.
    #0 0x81aecfa in k_sys_fatal_error_handler /home/runner/work/sof/sof/workspace/sof/zephyr/wrapper.c:352:19
    #1 0x815e3c9 in assert_post_action /home/runner/work/sof/sof/workspace/zephyr/lib/os/assert.c:43:2
    #2 0x8188b30 in comp_buffer_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc-helper.c:188:3
    #3 0x81819b6 in ipc_buffer_to_comp_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/helper.c:594:9
    #4 0x8181404 in ipc_comp_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/helper.c:[622](https://github.com/thesofproject/sof/actions/runs/10144692547/job/28048711573?pr=9338#step:7:623):10
    #5 0x8176f3e in ipc_glb_tplg_comp_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/handler.c:1375:9
    #6 0x8175bd4 in ipc_cmd /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/handler.c:1654:9
    #7 0x818e639 in ipc_platform_do_cmd /home/runner/work/sof/sof/workspace/sof/src/platform/posix/ipc.c:160:2
    #8 0x818750c in ipc_do_cmd /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc-common.c:326:9
    #9 0x81af56b in task_run /home/runner/work/sof/sof/workspace/sof/zephyr/include/rtos/task.h:94:9
    #10 0x81af0de in edf_work_handler /home/runner/work/sof/sof/workspace/sof/zephyr/edf_schedule.c:32:16

Originally posted by @tmleman in https://github.com/thesofproject/sof/issues/9338#issuecomment-2260058580

EDIT: somehow this was run again and is passing now: https://github.com/thesofproject/sof/actions/runs/10144692547/job/28093942074?pr=9338

Fuzzing is typically non-deterministic.

I have another PR with the same update and there was no failure there. Locally, I initially couldn't hit it either. That's why I did a rerun of fuzzing for this PR. After some longer runs, I now have the same failure and I must admit that I don't quite understand the point of this assert. The assumption seems to be that when creating components we know they will be on separate cores and that they will be connected. That's why we set the buffer as shared, but I don't know how to verify this at an earlier stage so that here we can be sure that the buffer is shared. I will also add that the same case reproduces for me on the current main.

marc-hb avatar Jul 31 '24 18:07 marc-hb

If this happens due to some invalid IPC sequence then comp_buffer_connect() should return an error code, not assert()

asserts can be disabled in production and they are disabled in production right now, example: https://github.com/thesofproject/sof/issues/9308#issuecomment-2244736317

Asserts are only a debugging aid for "impossible"/buggy conditions.

Invalid IPC sequences are not considered "impossible": IPC fuzzing would be entirely pointless if they were.

marc-hb avatar Jul 31 '24 18:07 marc-hb

Question: does IPC3 allow binding components located on separate cores? As I see in the code - no.

If its true (cross core bind is not allowed) - that means the assert is at right place and protection against the illegal situation should be in ipc_comp_to_buffer_connect

if its false (cross core bind is allowed) - assert should be replaced with an error exit code

in both situations - a fix is needed

marcinszkudlinski avatar Aug 05 '24 09:08 marcinszkudlinski

Question: does IPC3 allow binding components located on separate cores? As I see in the code - no.

If its true (cross core bind is not allowed) - that means the assert is at right place and protection against the illegal situation should be in ipc_comp_to_buffer_connect

if its false (cross core bind is allowed) - assert should be replaced with an error exit code

in both situations - a fix is needed

Its not allowed and the assert is ok as there are no IPC3 multicore users today. If someone does add multicore for IPC3 in the future they can address.

lgirdwood avatar Aug 05 '24 12:08 lgirdwood

that means the assert is at right place and protection against the illegal situation should be in ipc_comp_to_buffer_connect()

Its not allowed and the assert is ok as there are no IPC3 multicore users today.

@marcinszkudlinski could you please add an error code in ipc_comp_to_buffer_connect()? Then probably cherry-pick to stable-v2.2.

marc-hb avatar Aug 05 '24 17:08 marc-hb

well, it is a bit more complicated. As I see the check is already in place:

static int ipc_buffer_to_comp_connect(struct ipc_comp_dev *buffer,
				      struct ipc_comp_dev *comp)
{
	tr_dbg(&ipc_tr, "ipc: comp sink %d, source %d -> connect", comp->id,
	       buffer->id);

#if CONFIG_INCOHERENT
	if (comp->core != buffer->cb->core) {
		tr_err(&ipc_tr, "ipc: shared buffers are not supported for IPC3 incoherent architectures");
		return -ENOTSUP;
	}
#endif
	return comp_buffer_connect(comp->cd, comp->core, buffer->cb,
				   PPL_CONN_DIR_BUFFER_TO_COMP);
}

its checked for incoherent arch only, but the assert is there in every case

@marc-hb @lgirdwood Looks like fuzzing is performed with CONFIG_INCOHERENT = 0 What coherent architectures we support? Is fuzzing also performed with CONFIG_INCOHERENT = 1, especially for IPC4?

The quick fix would be to put the assert in question under conditional compiling, but there may be more places where checks or even "is_shared" flags are not necessary, checking in progress

marcinszkudlinski avatar Aug 06 '24 13:08 marcinszkudlinski

The mystery deepens: I tried to reproduce but I couldn't.

  • I downloaded the unit "crash" file from https://github.com/thesofproject/sof/actions/runs/10144692547/attempts/1?pr=9338 (attached here before that log expires, should have done that sooner). IPC3 fuzz crash 9343.zip

  • git fetch origin 1067e5feb4232c1e115995aaa4d1ac08f34f1dda # merge commit in https://github.com/thesofproject/sof/actions/runs/10144692547/job/28048711573?pr=9338

  • west update

  • scripts/fuzz.sh -b -p -- -DCONFIG_IPC_MAJOR_3=y [ -DEXTRA_CFLAGS="-O0 -g3 ]

  • gdb -tui ./build-fuzz/zephyr/zephyr.exe

  • break ipc_comp_connect

  • run ./crash-117ab2d04ed5afa9417f47d35e1ffd7182663826

But SOF never even reaches ipc_buffer_to_comp_connect()! It returns an -EINVAL from ipc_comp_connect() first because icd_source is NULL:

int ipc_comp_connect(struct ipc *ipc, ipc_pipe_comp_connect *_connect)
{
	struct sof_ipc_pipe_comp_connect *connect = ipc_from_pipe_connect(_connect);
	struct ipc_comp_dev *icd_source;
	struct ipc_comp_dev *icd_sink;

	/* check whether the components already exist */
	icd_source = ipc_get_comp_dev(ipc, COMP_TYPE_ANY, connect->source_id);
	if (!icd_source) {
		tr_err(&ipc_tr, "ipc_comp_connect(): source component does not exist, source_id = %u sink_id = %u",
		       connect->source_id, connect->sink_id);
		return -EINVAL;
	}


What did I miss?

I have a newer Clang version (18), could that ~make a difference~? EDIT: tried on Ubuntu 22.04 just like Github and no difference.

marc-hb avatar Aug 06 '24 16:08 marc-hb

Its not allowed and the assert is ok as there are no IPC3 multicore users today. If someone does add multicore for IPC3 in the future they can address.

Right, but the less proximate problem here is that the cross-core bind is accessible via the IPC3 protocol even if it's "not allowed" as a matter of policy and device configuration, and the case isn't being handled as a protocol error, only as a an assertion failure downward in the stack when it hits an unexpected situation.

As far as determinism: I've had mixed success with the reproducer files generated by the fuzzer too. It seems like too many things can affect the flow. Presumably what's happened is that the previously-fuzzed IPC commands left SOF in a weird state, but that may not be reachable in a reliable way from a developer system. Generally it's OK just to look at a failure and understand how we "could" have reached it and then fix via analysis instead.

If we're really sure that this "shouldn't be able to happen" then I guess we have work to do.

andyross avatar Aug 06 '24 17:08 andyross

Presumably what's happened is that the previously-fuzzed IPC commands left SOF in a weird state, but that may not be reachable in a reliable way from a developer system.

Thanks! This means the crash-... file is technically incomplete. This is disappointing - and confusing/time-consuming.

@marc-hb @lgirdwood Looks like fuzzing is performed with CONFIG_INCOHERENT = 0 What coherent architectures we support? Is fuzzing also performed with CONFIG_INCOHERENT = 1, especially for IPC4?

Thanks, I took a quick look at this and I'm afraid we have some serious Kconfig problems here (what's new?). CONFIG_INCOHERENT is guarded by if (CONFIG_)XTENSA in src/platform/Kconfig. But:

  • (CONFIG_)XTENSA is not defined when fuzzing: neither fuzzing IPC3 nor fuzzing IPC4
  • More surprisingly, (CONFIG_)XTENSA is not defined either when compiling IPC3 on the main branch?! I tried ./scripts/docker-run.sh ./scripts/xtensa-build-all.sh rn.

So, in all of these cases CONFIG_INCOHERENT is not even reachable.

It would help if Kconfig+cpp could make a difference between "undefined" and zero but I'm afraid that ship has sailed a long time ago; sorry for the digression.

Long story short I think the very first step should be "top-down": first making sense of and clarifying the CONFIG_INCOHERENT situation. Only then we can start looking at the C code.

marc-hb avatar Aug 06 '24 19:08 marc-hb

Thanks! This means the crash-... file is technically incomplete. This is disappointing - and confusing/time-consuming.

To be clear: some of these files have worked for me in the past. When they do, they save an ENORMOUS amount of testing time. So don't just cast them aside. Just disappointing to learn they don't always work.

marc-hb avatar Aug 06 '24 19:08 marc-hb

I think the best solution would be:

int comp_buffer_connect(struct comp_dev *comp, uint32_t comp_core,
			struct comp_buffer *buffer, uint32_t dir)
{
	/* check if it's a connection between cores */
	if (buffer->core != comp_core) {
#if CONFIG_INCOHERENT
		/* buffer must be shared */
		assert(buffer->is_shared);
#else
		buffer->is_shared = true;
#endif
		if (!comp->is_shared)
			comp_make_shared(comp);
	}

	return pipeline_connect(comp, buffer, dir);
}

for incoherent archs the buffer must be allocated as shared/not shared, for coherent it may be set as shared at any moment

@lgirdwood ?

marcinszkudlinski avatar Aug 08 '24 10:08 marcinszkudlinski

https://github.com/thesofproject/sof/pull/9356 please comment

marcinszkudlinski avatar Aug 09 '24 08:08 marcinszkudlinski

Even if 9356 is perfect, it does not change anything to the apparently messy Kconfig situation... @andyross could you help there? Should CONFIG_INCOHERENT really be gated by CONFIG_XTENSA? Can we align build-fuzz/.config with build/.config more? Some differences are unavoidable but they should be as small as possible: fuzzing something different from the product is both dangerous and a bit of a waste.

marc-hb avatar Aug 09 '24 17:08 marc-hb

Should CONFIG_INCOHERENT really be gated by CONFIG_XTENSA?

Likely yes? The latter is a Zephyr kconfig indicating the architecture, which always has an incoherent L1 cache. The former is a SOF tunable I'm less familiar with. From the perspective of software[1], the incoherence is irrelevant when there's only one CPU. So my guess is that SOF allows this to be =n on single CPU builds? The equivalent (but logically converse) Zephyr abstraction is something called CONFIG_KERNEL_COHERENCE, which forces mutable .data/.bss access to be uncached and does some trickery to allow stacks to be cached, while leaving everything else (.rodata and instruction literals in .text) cached.

But regardless, the only architecture in this whole problem space with a truly incoherent cache is Xtensa, and that's very unlikely to change.

[1] And software only! Things like host-shared memory and DMA buffers still care about coherence even with a single CPU, but that's generally handled at the driver layer and not global firmware behavior, AFAIK.

andyross avatar Aug 09 '24 20:08 andyross

Should CONFIG_INCOHERENT really be gated by CONFIG_XTENSA?

Likely yes? The latter is a Zephyr kconfig indicating the architecture, which always has an incoherent L1 cache

So does this mean the "INCOHERENT" code cannot be fuzzed? (This is why I asked in the first place)

marc-hb avatar Aug 10 '24 21:08 marc-hb

@marc-hb even if you enable the flag for a cohernt arch, it won't change much (well, except some asserts like the above) - but the main problem, the infamous cache incoherency, won't be tested with fuzzing.

we can think of some backdoors - fuzzing is not any of productions' compilations anyway, but better would be to run fuzzing using an emulator.

marcinszkudlinski avatar Aug 12 '24 06:08 marcinszkudlinski

but the main problem, the infamous cache incoherency, won't be tested with fuzzing.

AFAIK fuzzing is single threaded so I was never expecting fuzzing to catch memory coherency issues, that's not the point I was trying to make.

My main point is: we should always minimize #ifdef differences between fuzzed code and production code as much as possible so GitHub issues like this one don't even exist and we save a lot of time. Also, we have a greater chance to catch actual input sanitization issues. It's not about fuzzing finding very complex bugs but about minimizing #ifdef differences to have fewer false positives and fewer false negatives.

(well, except some asserts like the above)

Maybe it's not very "ambitious" but that's more like my main point.

but better would be to run fuzzing using an emulator.

That does not sound mutually exclusive. You can never have too much test coverage. You can only have limited validation time and resources and bad validation priorities.

marc-hb avatar Aug 12 '24 19:08 marc-hb

AFAIK fuzzing is single threaded so I was never expecting fuzzing to catch memory coherency issues, that's not the point I was trying to make.

You're right

My main point is: we should always minimize #ifdef differences between fuzzed code and production code as much as possible

Well, I believe a backdoor than is not a bad idea,

marcinszkudlinski avatar Aug 13 '24 08:08 marcinszkudlinski

@marc-hb @marcinszkudlinski I believe this can be closed now with https://github.com/thesofproject/sof/pull/9356 merged ?

kv2019i avatar Aug 20 '24 07:08 kv2019i

@kv2019i the assert won't go off anymore. @marc-hb maybe open another issue for fuzzing compilation params?

marcinszkudlinski avatar Aug 20 '24 10:08 marcinszkudlinski

@marc-hb maybe open another issue for fuzzing compilation params?

I just filed:

  • #9386

Fuzzing is just the messenger.

marc-hb avatar Aug 22 '24 00:08 marc-hb

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

github-actions[bot] avatar Jun 19 '25 03:06 github-actions[bot]