Stateful sigs segfault on macOS and Linux
When running the stateful signature example, we get a segfault on macOS and Linux. @Guiliano99 can you please check if you get a chance?
Hi, I just re-tested the example on my machine (Debian 12) and it worked as expected. I freshly installed liboqs before running the test.
Environment:
- liboqs: 0.14.0
- liboqs-python: 0.14.0
@Guiliano99 I'm getting, on Debian 12 (VirtualBox), 2 CPUs, 8Gb RAM
~/GitHub/liboqs-python (v0.14)$ python examples/stfl_sig.py
liboqs version: 0.14.1-dev
liboqs-python version: 0.14.0
Enabled stateful signature mechanisms:
('XMSS-SHA2_10_256', 'XMSS-SHA2_16_256', 'XMSS-SHA2_20_256',
'XMSS-SHAKE_10_256', 'XMSS-SHAKE_16_256', 'XMSS-SHAKE_20_256',
'XMSS-SHA2_10_512', 'XMSS-SHA2_16_512', 'XMSS-SHA2_20_512',
'XMSS-SHAKE_10_512', 'XMSS-SHAKE_16_512', 'XMSS-SHAKE_20_512',
'XMSS-SHA2_10_192', 'XMSS-SHA2_16_192', 'XMSS-SHA2_20_192',
'XMSS-SHAKE256_10_192', 'XMSS-SHAKE256_16_192', 'XMSS-SHAKE256_20_192',
'XMSS-SHAKE256_10_256', 'XMSS-SHAKE256_16_256', 'XMSS-SHAKE256_20_256',
'XMSSMT-SHA2_20/2_256', 'XMSSMT-SHA2_20/4_256', 'XMSSMT-SHA2_40/2_256',
'XMSSMT-SHA2_40/4_256', 'XMSSMT-SHA2_40/8_256', 'XMSSMT-SHA2_60/3_256',
'XMSSMT-SHA2_60/6_256', 'XMSSMT-SHA2_60/12_256', 'XMSSMT-SHAKE_20/2_256',
'XMSSMT-SHAKE_20/4_256', 'XMSSMT-SHAKE_40/2_256', 'XMSSMT-SHAKE_40/4_256',
'XMSSMT-SHAKE_40/8_256', 'XMSSMT-SHAKE_60/3_256', 'XMSSMT-SHAKE_60/6_256',
'XMSSMT-SHAKE_60/12_256', 'LMS_SHA256_H5_W1', 'LMS_SHA256_H5_W2',
'LMS_SHA256_H5_W4', 'LMS_SHA256_H5_W8', 'LMS_SHA256_H10_W1',
'LMS_SHA256_H10_W2', 'LMS_SHA256_H10_W4', 'LMS_SHA256_H10_W8',
'LMS_SHA256_H15_W1', 'LMS_SHA256_H15_W2', 'LMS_SHA256_H15_W4',
'LMS_SHA256_H5_W8_H5_W8', 'LMS_SHA256_H10_W4_H5_W8',
'LMS_SHA256_H20_W8_H10_W8', 'LMS_SHA256_H20_W8_H15_W8',
'LMS_SHA256_H20_W8_H20_W8')
Segmentation fault: 11 python examples/stfl_sig.py
Similarly on macOS 15.6.
Did you enable the STFL sigs in liboqs? I.e., compile liboqs with -D OQS_ENABLE_SIG_STFL_LMS and/or -D OQS_ENABLE_SIG_STFL_XMSS?
I will implement this in the C++ and Go wrappers, so I'll probably figure out what is going on, as debugging C++ code that calls C call is definitely easier than debugging Python calling C. But if you have some ideas please let me know.
Hi, I also tried it with that version, but I did not get a segmentation fault. Sorry, I’m not sure what went wrong.
I used those three flags:
Stateful signature flags:
- DOQS_ENABLE_SIG_STFL_LMS=ON
- DOQS_ENABLE_SIG_STFL_XMSS=ON
- DOQS_HAZARDOUS_EXPERIMENTAL_ENABLE_SIG_STFL_KEY_SIG_GEN=ON
@Guiliano99 Thanks, my bad. I didn't have this enabled: DOQS_HAZARDOUS_EXPERIMENTAL_ENABLE_SIG_STFL_KEY_SIG_GEN=ON. Now it works.
Doesn't it make sense though to have the above enabled whenever we enable LMS or XMSS?
And if not, we should gracefully exit, not with a segfault.
In addition, the unit tests for the XMSS family take a very long time (and crash the GitHub CI). Wonder weather we should keep them, or disable. For example, on my MacBook Pro M2, 32Gb RAM, running the stfl example with XMSS-SHA2_16_256 takes 34 seconds.
@dstebila @baentsch any comments on those two points above?
Hi, good to hear that it’s working now.
I also noticed that the pipeline takes too long because key generation for algorithms like XMSS-SHA2_20_256 takes a lot of time. Would it be possible to pre-generate some of these long-generation keys, save them in PKCS#8 (OneAsymmetricKey RFC5958) format, and then load them with pyasn1? That way, at least the signing and verification logic can still be tested without the slow key generation step.
@vsoftco @dstebila @baentsch
@Guiliano99 Thanks, my bad. I didn't have this enabled:
DOQS_HAZARDOUS_EXPERIMENTAL_ENABLE_SIG_STFL_KEY_SIG_GEN=ON. Now it works. Doesn't it make sense though to have the above enabled whenever we enableLMSorXMSS? And if not, we should gracefully exit, not with a segfault.In addition, the unit tests for the
XMSSfamily take a very long time (and crash the GitHub CI). Wonder weather we should keep them, or disable. For example, on my MacBook Pro M2, 32Gb RAM, running the stfl example withXMSS-SHA2_16_256takes 34 seconds.
I think it is okay if you want to pre-generate some key pairs to streamline CI tests here in liboqs-python.
Just stating the obvious: It'd be sensible to pre-generate keys only in the CI cache; this avoids any check in of key material that may inadvertently wind up in 'active code, somewhere.
Just stating the obvious: It'd be sensible to pre-generate keys only in the CI cache; this avoids any check in of key material that may inadvertently wind up in 'active code, somewhere.
That's a good suggestion (and wasn't obvious to me).
That's a good suggestion (and wasn't obvious to me).
Any check-in of key material to GH is risky -- for whichever reason, incl. CI. Experience shows it will find its way to "productive systems" and code breakage and bad publicity is waiting down the line. Good repos have checks to ensure this does not happen (rf. "secret scanners", e.g., https://docs.github.com/en/code-security/secret-scanning/introduction/about-secret-scanning).
Hi everyone,
I just wanted to clarify the ideas around caching the keys. I’m still new to this, but since I introduced this issue, I’d like to help fix it. Do you have any suggestions on how we could address it?
One idea I had was to create a third pipeline dedicated to key generation — something that could be triggered manually or on a merge request. Would that make sense?
Regarding the missing flag for STFL key generation: I’m not sure what the best approach is.
Handling it through the signal library, multiprocessing, or threading seems possible, but it would make the code more complex and less clean.
Do you have any better ideas for how to handle this?
I also noticed that the test cases take significantly longer when using liboqs version 0.15.0-rc1 — the full run took about 1h 19m 36s. I assume this slowdown is related to the SLH-DSA integration. Do you have any thoughts or suggestions on how to address the longer runtime? Should I also pre-generate some keys until the runtime is acceptable? And if so, what would be considered an acceptable duration for the tests?
Thanks in advance for your help. @vsoftco @dstebila @baentsch
I’d like to help fix it.
Thanks! That'd be great!
I assume this slowdown is related to the SLH-DSA integration.
Is there any way you can confirm this is indeed the culprit, or did you already do it? E.g., build a version with SLH-DSA disabled and re-test?
Do you have any suggestions on how we could address it?
Yes: Utilize CI caching.
One idea I had was to create a third pipeline dedicated to key generation — something that could be triggered manually or on a merge request. Would that make sense?
This sounds very complicated. Why not simply check for presence of keys in (files in) a (to-be introduced) CI cache and generate them if not found? Or would this take longer than the time permitted per CI run? In that case, indeed, some form of splitting this up/separate CI flow would be needed.
Regarding the missing flag for STFL key generation: I’m not sure what the best approach is. Handling it through the signal library, multiprocessing, or threading seems possible, but it would make the code more complex and less clean. Do you have any better ideas for how to handle this?
I cannot follow this chain of thought. I'd have thought the idea is to build in CI one variant with these flags set and only for that variant do keygen/sign testing (using the caching mechanism as discussed above), no? Am I missing something?
And one more thought re-reading all of the above, @vsoftco: You state
Doesn't it make sense though to have the above enabled whenever we enable LMS or XMSS?
No, it does not: See https://github.com/open-quantum-safe/liboqs/pull/1650#issuecomment-1893957110.
And if not, we should gracefully exit, not with a segfault.
That is completely agreed.
Is there any way you can confirm this is indeed the culprit, or did you already do it? E.g., build a version with SLH-DSA disabled and re-test?
No, I haven’t tried that yet. I just ran it with the latest version and noticed that the runtime increased significantly. I’ll try rebuilding with SLH-DSA disabled and check if that’s the cause.
Yes: Utilize CI caching.
I wasn’t aware of this feature before, so thanks for the hint. I’ll have look at it.
This sounds very complicated. Why not simply check for presence of keys in (files in) a (to-be introduced) CI cache and generate them if not found? Or would this take longer than the time permitted per CI run? In that case, indeed, some form of splitting this up/separate CI flow would be needed.
Yes, the pipeline would take longer than the 6-hour limit, which is why I’m unsure how to handle it. Maybe it can be split across multiple runs. I’ll check that.
I cannot follow this chain of thought. I'd have thought the idea is to build in CI one variant with these flags set and only for that variant do keygen/sign testing (using the caching mechanism as discussed above), no? Am I missing something? And if not, we should gracefully exit, not with a segfault.
One feature is that the STFL flag OQS_HAZARDOUS_EXPERIMENTAL_ENABLE_SIG_STFL_KEY_SIG_GEN is not automatically enabled when XMSS or LMS algorithms are active. Since these are stateful algorithms, even OpenSSL 3.6 only supports verification, which I now understand was intention as well. So this behavior will not change.
However, @vsoftco suggested improving the handling so it’s easier to identify when key generation for XMSS, XMSSMT, or LMS fails, which could happen if liboqs wasn’t built with the flag mentioned above.
I saw that this could be handled by adding some wrapper code around the generate_keypair function, but it would make the code a bit "ugly". That’s why I wanted to get your opinion first. I also added the faulthandler module for debugging to help pinpoint where the crash occurs. I’m wondering whether it would be better to just mention this limitation in the README, or to add explicit error handling in the code to print a more informative message.
to add explicit error handling in the code to print a more informative message
That would be my preference. Documentation is good, too; crashes should not happen regardless of people reading documentation (or not :).