opentitan icon indicating copy to clipboard operation
opentitan copied to clipboard

[manuf] fix flaky personalize_functest and reactivate in CI

Open pamaury opened this issue 2 years ago • 9 comments

Description

Some of the manufacturing tests are failing randomly in the CI (relatively rare occurence). The most common failures (in//sw/device/silicon_creator/manuf/lib) are:

  • provisioning_functest_fpga_cw310_rom_with_fake_keys
  • manuf_cp_device_info_flash_wr_test_unlocked0_to_prod_end_functest_fpga_cw310_rom_with_fake_keys
  • /manuf_cp_device_info_flash_wr_test_unlocked0_to_prod_functest_fpga_cw310_rom_with_fake_keys

I have created a stress PR #19076 to try to reproduce them.

pamaury avatar Jun 29 '23 11:06 pamaury

Any luck on tracking this down @pamaury ?

timothytrippel avatar Aug 23 '23 22:08 timothytrippel

I have been working on manuf_cp_device_info_flash_wr and I think I found the root cause. This test starts with a bitstream with ROM execution disabled so there is no issue connecting to the JTAG initially. But then, the RAM program calls manuf_individualize_device_sw_cfg that will write the CREATOR_SW_CFG partition and enable ROM execution (with the current settings in earlgrey_a0_skus). We then call trigger_lc_transition, where we set the strap pins and reset the target and connect over JTAG again so that the manufacturing test can later on check the transition was correctly done. This is a problem though because now ROM execution is enabled and the flash is empty so the device keeps rebooting which break the JTAG connection.

We have several uses of trigger_lc_transition and potentially all of them are faulty. It seems to me that we should change trigger_lc_transition to just disconnect the JTAG and reset the target and leave it to the caller to decide what to do since it is not always safe to reconnect the JTAG. We should review all uses.

For manuf_cp_device_info_flash_wr, a simple solution would be to replace the JTAG-based LC test by simpling looking at the UART output of the ROM and check the LCV:xxxxx line.

pamaury avatar Aug 24 '23 15:08 pamaury

I started to review all the uses of trigger_lc_transition and here are my conclusions (but feel free to double-check):

  • the bitstream_otp_ctrl_functest bitstream has execution disabled although this is not obvious and should probably be documented, same for `
  • manuf_cp_test_lock_functest is safe: uses bitstream_otp_ctrl_functest and does not change the creator partition
  • manuf_cp_unlock_raw is safe: ROM execution is not enabled
  • manuf_scrap: looks safe since CPU is not enabled in SCRAP. BUT, isn't the LC TAP supposed to be disabled in SCRAP per the documentation?
  • manuf_cp_yield: safe, uses a bitstream with ROM execution disabled
  • personalize_functest: potentially unsafe: first bootstraps a program to personalize the device and then triggers an LC transition. But the transition is done after the program is done executing and I don't see anything in the code that prevents the device from rebooting though I am not sure.
  • cp_provision_init: safe, starts in RAW state and bootstraps a spinning binary after transition

I think in the future, we might want to harden those manufacturing tests with some make_sure_device_is_not_rebooting() function that checks in key points that the device is not constantly rebooting. For example this could be done by looking at the UART console for one second and making sure that the ROM is not repeatedly reboot.

What do you think @timothytrippel ?

pamaury avatar Aug 24 '23 23:08 pamaury

I started to review all the uses of trigger_lc_transition and here are my conclusions (but feel free to double-check):

  • the bitstream_otp_ctrl_functest bitstream has execution disabled although this is not obvious and should probably be documented, same for

same for ?

  • manuf_cp_test_lock_functest is safe: uses bitstream_otp_ctrl_functest and does not change the creator partition
  • manuf_cp_unlock_raw is safe: ROM execution is not enabled
  • manuf_scrap: looks safe since CPU is not enabled in SCRAP. BUT, isn't the LC TAP supposed to be disabled in SCRAP per the documentation?

Hmm, good point. Its seems the LC TAP does work even in SCRAP. @msfschaffner is this expected?

  • manuf_cp_yield: safe, uses a bitstream with ROM execution disabled
  • personalize_functest: potentially unsafe: first bootstraps a program to personalize the device and then triggers an LC transition. But the transition is done after the program is done executing and I don't see anything in the code that prevents the device from rebooting though I am not sure.

This one starts in a mission mode LC state (i.e., DEV/PROD), loads a program into flash to test the personalization lib, which includes provisioning the RMA token and exporting it in encrypted form back to the host. The host test harness then checks the validity of the RMA token by trying to use it to perform an RMA LC transition, which triggers a reset, since there is a program in in flash, everything should be ok (though it looks like it is not from your stress tests results?), but we could enhance the device binary to spin-wait if it detects the LC state to be RMA.

  • cp_provision_init: safe, starts in RAW state and bootstraps a spinning binary after transition

I think in the future, we might want to harden those manufacturing tests with some make_sure_device_is_not_rebooting() function that checks in key points that the device is not constantly rebooting. For example this could be done by looking at the UART console for one second and making sure that the ROM is not repeatedly reboot.

Yeah thats not a bad idea.

What do you think @timothytrippel ?

Thanks for looking into this @pamaury , glad its just a SW issue, not a HW issue :)

timothytrippel avatar Aug 25 '23 19:08 timothytrippel

I have been working on manuf_cp_device_info_flash_wr and I think I found the root cause. This test starts with a bitstream with ROM execution disabled so there is no issue connecting to the JTAG initially. But then, the RAM program calls manuf_individualize_device_sw_cfg that will write the CREATOR_SW_CFG partition and enable ROM execution (with the current settings in earlgrey_a0_skus). We then call trigger_lc_transition, where we set the strap pins and reset the target and connect over JTAG again so that the manufacturing test can later on check the transition was correctly done. This is a problem though because now ROM execution is enabled and the flash is empty so the device keeps rebooting which break the JTAG connection.

What if, instead of checking the LC state post transition on the host side, we just check it on the device side. I.e., IIRC, that test loads an SRAM program when it is in the TEST_UNLOCKED* state to provision the wafer auth secret in flash info page 3, then transitions to a mission mode state (e.g., PROD) where it attempts to read it back to verify it is correct. If we just did the transition over the JTAG LC TAP, reset, then do not try to reconnect JTAG, but rather just let flash boot, we could check the LC state on the device side no?

We have several uses of trigger_lc_transition and potentially all of them are faulty. It seems to me that we should change trigger_lc_transition to just disconnect the JTAG and reset the target and leave it to the caller to decide what to do since it is not always safe to reconnect the JTAG. We should review all uses.

For manuf_cp_device_info_flash_wr, a simple solution would be to replace the JTAG-based LC test by simpling looking at the UART output of the ROM and check the LCV:xxxxx line.

timothytrippel avatar Aug 25 '23 19:08 timothytrippel

Thanks for the suggestions, I am trying to implement them. While working on personalize_functest, I ran into something I do not understand: after the device side has exported the manuf_personalize_t data, I added a loop so that the host side can connect over JTAG to check the LC state (the current code is unsafe because the device just reboots in a loop). This kind of works but displays the following message:

[2023-08-28T11:44:59Z INFO  opentitanlib::util::openocd::stderr] Info : JTAG tap: lc_ctrl.tap tap/device found: 0x10002cdf (mfg: 0x66f (lowRISC), part: 0x0002, ver: 0x1)
[2023-08-28T11:44:59Z INFO  opentitanlib::util::openocd::stderr] Error: Debugger is not authenticated to target Debug Module. (dmstatus=0x40010002). Use `riscv authdata_read` and `riscv authdata_write` commands to authenticate.

The Debugger is not authenticated message keeps repeating until the end of the test. I have checked and as far as I can tell, the device is not rebooting. Also the value of dmstatus is extremely suspicious since the top bits of dmstatus are supposed to be always 0. I am not entirely sure what is happening. Strangely, the JTAG still responds correctly to commands.

EDIT: after more investigation, the Debugger is not authenticated message seems to start happening after locking the secret2 partition (otp_ctrl_testutils_lock_partition(otp_ctrl, kDifOtpCtrlPartitionSecret2, 0)) and rebooting (it occurs even if we don't do the RMA transition).

pamaury avatar Aug 28 '23 11:08 pamaury

I pushed an updated fix in #19526. Even though I believe this fixes the problem with JTAG itself, I have still observed one failure of personalize_functest where the LC did not report an error but did not perform a transition either (ie stayed in DEV). I have not been able to reproduce it so far.

pamaury avatar Aug 28 '23 15:08 pamaury

I think we can close this now. The last outstanding issue is tracked in #20580. Feel free to reopen again if needed.

timothytrippel avatar Jan 12 '24 22:01 timothytrippel

The test is still flaky, 75% pass rate in the last 14 days according to the dashboard.

engdoreis avatar Jan 25 '24 10:01 engdoreis