ot-sca icon indicating copy to clipboard operation
ot-sca copied to clipboard

Debug aes-fvsr-key-batch timing issues

Open vogelpi opened this issue 3 years ago • 10 comments

PR #82 added the corresponding target binaries to support the aes-fvsr-key-batch command. For most of us this seems to work just fine but I saw lots of errors like:

Fixed key: b'811e3731b0120a7842781e22b25cddf9'
Connected to ChipWhisperer (num_samples: 1200, num_samples_actual: 1200, num_segments_actual: 1)
Capturing:  16%|███▋                   | 162/1000 [00:00<00:01, 492.97 traces/s]
Traceback (most recent call last):
  File "./capture.py", line 771, in <module>
    app()
 ...
  File "./capture.py", line 502, in aes_fvsr_key_batch
    capture_aes_fvsr_key_batch(
  File "./capture.py", line 460, in capture_aes_fvsr_key_batch
    check_ciphertext(ot, expected_last_ciphertext, 4)
  File "./capture.py", line 221, in check_ciphertext
    assert actual_last_ciphertext == expected_last_ciphertext[0:ciphertext_len], (
AssertionError: Incorrect encryption result!
actual: bytearray(b'^\xdc\xe8{')
expected: [ 85 154 226   3  30  12  31 147  54 197 151  35   4 134  80  98]

In words, there are sometimes mismatches between the expected and received ciphertexts. Sometimes, the failure occurs for the first batch, sometimes for a later batch, sometimes it doesn't occur at all. The failures seem to depend on timing (adding some sleep command on the Python side seem to help), Husky firmware (the latest firmware seems to be more affected), and maybe also USB connection setup (docking station, hub, laptop directly).

We should root cause and fix the problem. Otherwise we can't reliably do long running captures. Imagine we collect 10 Mio traces and get such a failure after 2 hours - all traces will be lost.

vogelpi avatar Jun 20 '22 16:06 vogelpi

I have investigated the issue and I guess I found the root cause.

We have two consecutive serial write command in 'capture_aes_fvsr_key_batch()' function in 'capture.py' script:

ot.target.simpleserial_write("s", capture_cfg["batch_prng_seed"].to_bytes(4, "little"))
ot.target.simpleserial_write("t", key_fixed)

In the device firmware 'aes_serial.c', we are polling incoming packets and call the relevant function:

while (true) {
    simple_serial_process_packet();
}

After starting the first operation, in this example seeding the prng, the second data packet may be sent before the first operation is completed and because of that it may be missed by the firmware. (Actually I would expect there would be some kind of buffer to hold serial data packets but it seems there aren't any.)

We can simply avoid this issue by placing delays between 'simpleserial_write()' functions. We are doing that unintentionly by waiting the result of the operation after starting it by sending an encrypt request.

However, the communication between the computer and device is not entirely deterministic and because of that putting a delay may not be a reliable solution. Maybe it would be better if we wait for a success message from the device after seeding PRNG, setting key etc. but not after starting an encryption because we are waiting the result anyway. It may decrease performance if we also wait for a success message after initializing an encryption request.

What do you think?

abdullahvarici avatar Aug 17 '22 14:08 abdullahvarici

simpleserial has an "ack" command that helps with this, and can be used to delay the protocol while a long-running process happens.

Checking the code I think the status is only sent on failure. By default the low-level simpleserial_write don't wait for the ack to return, only simpleserial_read will do that.

In the original CW you can see the status return here.

If you try inserting a status print after this line it may work there:

simple_serial_send_status(0);

Then to force this to wait:

ot.target.simpleserial_write("s", capture_cfg["batch_prng_seed"].to_bytes(4, "little"))
ot.target.simpleserial_wait_ack()
ot.target.simpleserial_write("t", key_fixed)

I would do similar things on any other write (you can see this is what we do for example on this key write function).

I haven't been building for OT for a bit so sorry I can't test quickly here, but will try to validate. However to avoid someone chasing this down when I think the issue is relatively simple.

colinoflynn avatar Aug 19 '22 19:08 colinoflynn

Thanks @colinoflynn! I will try this and create a PR.

abdullahvarici avatar Aug 22 '22 10:08 abdullahvarici

Although I had no problems with the kmac-fvsr-key-batch, I ran into similar problems while developing the sha3-fvsr-data-batch.

However, the behavior was a little bit different and more un(?)predictable. It always failed on my machine, but worked on @wettermo's machine. Further we needed the ot.target.simpleserial_wait_ack() after the ot.target.simpleserial_write( "b", scope.num_segments_actual.to_bytes(4, "little") )

Does it make sense to add an ot.target.simpleserial_wait_ack() after each and every command that does not provide an immediate response? I can have a look on that after finishing the sha3-fvsr tests.

m-temp avatar Nov 02 '22 12:11 m-temp

In my opinion, adding ot.target.simpleserial_wait_ack() would make it more stable but also the capture rate will drop.

abdullahvarici avatar Nov 02 '22 13:11 abdullahvarici

Can we estimate the impact on the capture rate? Personally, I'd favor to acknowledge as many commands as possible, as long as the impact on the capture rate is acceptable. See also the discussion in https://github.com/lowRISC/opentitan/pull/15929#discussion_r1015295374

m-temp avatar Nov 07 '22 13:11 m-temp

I am not sure if we can estimate it. Maybe we can try and see :) I agree with you. We may not have a problem when we didn't add ack after every command but it does not mean we won't, I guess.

abdullahvarici avatar Nov 07 '22 13:11 abdullahvarici

I'm working on this issue. First experiments show that there is no observable decrease of the capture rate after acknowledging all write commands. Once this is complete, I will update the scripts and the binaries.

EDIT: Source for generating aes binaries: here

vrozic avatar Oct 19 '23 14:10 vrozic

Interestingly, I haven't yet observed any such issues while batch capturing on husky and waverunner using the scripts here: https://github.com/lowRISC/ot-sca/pull/194 But I do check full ciphertext after every batch and all texts are AES encryptions of previous one going back to first one that is sent to OT.

johannheyszl avatar Oct 19 '23 16:10 johannheyszl

Since we will replace the communication layer for post-silicon to unify with the other post-silicon SiVal testing, we may not want to spend much effort in fixing this? :)

Generally I am fully in favor of more ack's and more robust communication even if sacrificing some speed. We need correct trace-data / text pairs

johannheyszl avatar Oct 23 '23 12:10 johannheyszl