spiffs icon indicating copy to clipboard operation
spiffs copied to clipboard

GC process reliability

Open rojer opened this issue 8 years ago • 15 comments

i set up a simple test where i take a filesystem image with no deleted files, delete a bunch, then call SPIFFS_gc and simulate power loss at each write. at some point, when the filesystem is later remounted, one of the files that were not touched (or should not have been, rather) becomes unreadable: this is a very critical problem, since devices tend to be reset at random times and GC occurence is unpredictable.

$ ./spls 0xdb000.bin 
image size: 131072
   0 iddddidddddiddd    era_cnt: 0
   1 diddddddddddddd    era_cnt: 0
   2 ddddddddddddddd    era_cnt: 0
   3 dddddidddddidid    era_cnt: 0
   4 ddddddddddddddd    era_cnt: 0
   5 dddddiddidddddd    era_cnt: 0
   6 dddddidddddddid    era_cnt: 0
   7 ddddddddddd____    era_cnt: 0
   8 _______________    era_cnt: 0
   9 _______________    era_cnt: 0
  10 _______________    era_cnt: 0
  11 _______________    era_cnt: 0
  12 _______________    era_cnt: 0
  13 _______________    era_cnt: 0
  14 _______________    era_cnt: 0
  15 _______________    era_cnt: 0
  16 _______________    era_cnt: 0
  17 _______________    era_cnt: 0
  18 _______________    era_cnt: 0
  19 _______________    era_cnt: 0
  20 _______________    era_cnt: 0
  21 _______________    era_cnt: 0
  22 _______________    era_cnt: 0
  23 _______________    era_cnt: 0
  24 _______________    era_cnt: 0
  25 _______________    era_cnt: 0
  26 _______________    era_cnt: 0
  27 _______________    era_cnt: 0
  28 _______________    era_cnt: 0
  29 _______________    era_cnt: 0
  30 _______________    era_cnt: 0
  31 _______________    era_cnt: 0
era_cnt_max: 1
last_errno:  0
blocks:      32
free_blocks: 24
page_alloc:  116
page_delet:  0
used:        29116 of 113201

-- files:
 server.key.pem: 916 0xfcbbff51
 conf.json: 1115 0xd0c2b474
 conf1.json: 803 0x1b18434d
 sys_config_schema.json: 8166 0x61be1ff1
 aws-iot-faa6f05c81.crt.pem: 1141 0x9e858b4d
 aws-iot-faa6f05c81.key.pem: 227 0xad3bfcc8
 ca-verisign-g5.crt.pem: 5142 0x2b19102d
 sys_ro_vars_schema.json: 332 0xdf6558a7
 server.crt.pem: 2752 0x29941f1d
 ca-verisign-ecc-g2.crt.pem: 1590 0x8635500e
 conf0.json: 2939 0x0128cb8c
-- overall crc32: 0x3f94ac09
$ ./sptest 0xdb000.bin 38                                                                                                                                                            
=== Will fail at write #38
image size: 131072
remove conf.json
 W  #0000 1 @ 1540
 W  #0001 2 @ 20
 W  #0002 1 @ 2820
 W  #0003 2 @ 18
 W  #0004 1 @ 2564
 W  #0005 2 @ 16
 W  #0006 1 @ 2308
 W  #0007 2 @ 14
 W  #0008 1 @ 2052
 W  #0009 2 @ 12
 W  #0010 1 @ 1796
 W  #0011 2 @ 10
 W  #0012 1 @ 1540
remove aws-iot-faa6f05c81.crt.pem
 W  #0013 1 @ 13828
 W  #0014 2 @ 12308
 W  #0015 1 @ 15108
 W  #0016 2 @ 12306
 W  #0017 1 @ 14852
 W  #0018 2 @ 12304
 W  #0019 1 @ 14596
 W  #0020 2 @ 12302
 W  #0021 1 @ 14340
 W  #0022 2 @ 12300
 W  #0023 1 @ 14084
 W  #0024 2 @ 12298
 W  #0025 1 @ 13828
remove aws-iot-faa6f05c81.key.pem
 W  #0026 1 @ 15364
 W  #0027 2 @ 12312
 W  #0028 1 @ 15620
 W  #0029 2 @ 12310
 W  #0030 1 @ 15364
== dumped fs to before_gc.bin
R   #0131 256 @ 0
R   #0132 2 @ 254
R   #0133 256 @ 4096
R   #0134 256 @ 8192
R   #0135 256 @ 12288
R   #0136 2 @ 12542
R   #0137 256 @ 16384
R   #0138 256 @ 20480
R   #0139 256 @ 24576
R   #0140 256 @ 28672
R   #0141 256 @ 32768
R   #0142 256 @ 36864
R   #0143 256 @ 40960
R   #0144 256 @ 45056
R   #0145 256 @ 49152
R   #0146 256 @ 53248
R   #0147 256 @ 57344
R   #0148 256 @ 61440
R   #0149 256 @ 65536
R   #0150 256 @ 69632
R   #0151 256 @ 73728
R   #0152 256 @ 77824
R   #0153 256 @ 81920
R   #0154 256 @ 86016
R   #0155 256 @ 90112
R   #0156 256 @ 94208
R   #0157 256 @ 98304
R   #0158 256 @ 102400
R   #0159 256 @ 106496
R   #0160 256 @ 110592
R   #0161 256 @ 114688
R   #0162 256 @ 118784
R   #0163 256 @ 122880
R   #0164 256 @ 126976
R   #0165 256 @ 12288
R   #0166 5 @ 12544
R   #0167 256 @ 12288
R   #0168 256 @ 16384
R   #0169 256 @ 20480
R   #0170 256 @ 24576
R   #0171 256 @ 28672
R   #0172 256 @ 32768
R   #0173 256 @ 36864
R   #0174 256 @ 40960
R   #0175 256 @ 45056
R   #0176 256 @ 49152
R   #0177 256 @ 53248
R   #0178 256 @ 57344
R   #0179 256 @ 61440
R   #0180 256 @ 65536
R   #0181 256 @ 69632
R   #0182 256 @ 73728
R   #0183 256 @ 77824
R   #0184 256 @ 81920
R   #0185 256 @ 86016
R   #0186 256 @ 90112
R   #0187 256 @ 94208
R   #0188 256 @ 98304
R   #0189 256 @ 102400
R   #0190 256 @ 106496
R   #0191 256 @ 110592
R   #0192 256 @ 114688
R   #0193 256 @ 118784
R   #0194 256 @ 122880
R   #0195 256 @ 126976
R   #0196 256 @ 0
R   #0197 256 @ 4096
R   #0198 5 @ 4608
R   #0199 256 @ 4608
R   #0200 256 @ 12288
R   #0201 5 @ 12544
R   #0202 256 @ 0
R   #0203 256 @ 4096
R   #0204 256 @ 8192
R   #0205 256 @ 12288
R   #0206 256 @ 16384
R   #0207 256 @ 20480
R   #0208 256 @ 24576
R   #0209 256 @ 28672
R   #0210 64 @ 12544
 W  #0031 64 @ 31744
R   #0211 64 @ 12608
 W  #0032 64 @ 31808
R   #0212 64 @ 12672
 W  #0033 64 @ 31872
R   #0213 64 @ 12736
 W  #0034 64 @ 31936
 W  #0035 2 @ 28694
 W  #0036 2 @ 12288
 W  #0037 1 @ 12548
R   #0214 256 @ 12288
R   #0215 5 @ 12800
R   #0216 256 @ 28672
R   #0217 64 @ 12800
 W  #0038 64 @ 32000
=== BOOM!
== dumped fs to boom.bin
[rojer@nba ~/go/src/cesanta.com/common/spiffs/tools mos1]$ ./spls before_gc.bin 
image size: 131072
   0 idddd//////iddd    era_cnt: 0
   1 diddddddddddddd    era_cnt: 0
   2 ddddddddddddddd    era_cnt: 0
   3 ddddd////////id    era_cnt: 0
   4 ddddddddddddddd    era_cnt: 0
   5 dddddiddidddddd    era_cnt: 0
   6 dddddidddddddid    era_cnt: 0
   7 ddddddddddd____    era_cnt: 0
   8 _______________    era_cnt: 0
   9 _______________    era_cnt: 0
  10 _______________    era_cnt: 0
  11 _______________    era_cnt: 0
  12 _______________    era_cnt: 0
  13 _______________    era_cnt: 0
  14 _______________    era_cnt: 0
  15 _______________    era_cnt: 0
  16 _______________    era_cnt: 0
  17 _______________    era_cnt: 0
  18 _______________    era_cnt: 0
  19 _______________    era_cnt: 0
  20 _______________    era_cnt: 0
  21 _______________    era_cnt: 0
  22 _______________    era_cnt: 0
  23 _______________    era_cnt: 0
  24 _______________    era_cnt: 0
  25 _______________    era_cnt: 0
  26 _______________    era_cnt: 0
  27 _______________    era_cnt: 0
  28 _______________    era_cnt: 0
  29 _______________    era_cnt: 0
  30 _______________    era_cnt: 0
  31 _______________    era_cnt: 0
era_cnt_max: 1
last_errno:  0
blocks:      32
free_blocks: 24
page_alloc:  102
page_delet:  14
used:        25602 of 113201

-- files:
 server.key.pem: 916 0xfcbbff51
 conf1.json: 803 0x1b18434d
 sys_config_schema.json: 8166 0x61be1ff1
 ca-verisign-g5.crt.pem: 5142 0x2b19102d
 sys_ro_vars_schema.json: 332 0xdf6558a7
 server.crt.pem: 2752 0x29941f1d
 ca-verisign-ecc-g2.crt.pem: 1590 0x8635500e
 conf0.json: 2939 0x0128cb8c
-- overall crc32: 0xdce86ff8
[rojer@nba ~/go/src/cesanta.com/common/spiffs/tools mos1]$ ./spls boom.bin 
image size: 131072
   0 idddd//////iddd    era_cnt: 0
   1 diddddddddddddd    era_cnt: 0
   2 ddddddddddddddd    era_cnt: 0
   3 /dddd////////id    era_cnt: 0
   4 ddddddddddddddd    era_cnt: 0
   5 dddddiddidddddd    era_cnt: 0
   6 dddddidddddddid    era_cnt: 0
   7 dddddddddddd___    era_cnt: 0
   8 _______________    era_cnt: 0
   9 _______________    era_cnt: 0
  10 _______________    era_cnt: 0
  11 _______________    era_cnt: 0
  12 _______________    era_cnt: 0
  13 _______________    era_cnt: 0
  14 _______________    era_cnt: 0
  15 _______________    era_cnt: 0
  16 _______________    era_cnt: 0
  17 _______________    era_cnt: 0
  18 _______________    era_cnt: 0
  19 _______________    era_cnt: 0
  20 _______________    era_cnt: 0
  21 _______________    era_cnt: 0
  22 _______________    era_cnt: 0
  23 _______________    era_cnt: 0
  24 _______________    era_cnt: 0
  25 _______________    era_cnt: 0
  26 _______________    era_cnt: 0
  27 _______________    era_cnt: 0
  28 _______________    era_cnt: 0
  29 _______________    era_cnt: 0
  30 _______________    era_cnt: 0
  31 _______________    era_cnt: 0
era_cnt_max: 1
last_errno:  0
blocks:      32
free_blocks: 24
page_alloc:  102
page_delet:  15
used:        25602 of 113201

-- files:
 server.key.pem: 916 0xfcbbff51
 conf1.json: 803 0x1b18434d
cannot read sys_config_schema.json, err: -10004

rojer avatar Mar 16 '17 15:03 rojer

sorry, that's a lot of output. the important bit is at the very end: sys_config_schema.json could not be read.

rojer avatar Mar 16 '17 15:03 rojer

Hmm.. If you run a SPIFFS_check, is it recovered? Guess not, as you get ERR_DELETED, but still.. Also, do you have a code example of how to reproduce this? I will probably need to add interrupted write capability to the test bench if I haven't done it already.

2017-03-16 16:45 GMT+01:00 Deomid Ryabkov [email protected]:

sorry, that's a lot of output. the important bit is at the very end: sys_config_schema.json could not be read.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pellepl/spiffs/issues/137#issuecomment-287099532, or mute the thread https://github.com/notifications/unsubscribe-auth/AE3NV3tbEAkQow3GWwqshFzlUv3HjyOuks5rmVkigaJpZM4MffWW .

pellepl avatar Mar 16 '17 15:03 pellepl

see attached

stupid_github.zip

[rojer@nba ~/cesanta/qwe master]$ tar xjf sptest.tar.bz2                                                                                                                                                
[rojer@nba ~/cesanta/qwe master]$ cd spiffs/tools/                                                                                                                                                                                            
[rojer@nba ~/cesanta/qwe/spiffs/tools master]$ make ./spls ./sptest SPIFFS_CONFIG_PATH=.
gcc -I. -I. -I/home/rojer/cesanta/qwe/spiffs/tools/.. -o spls spls.c mem_spiffs.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_cache.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_gc.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_nu
cleus.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_check.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_hydrogen.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_vfs.c -Wall -Werror -DNO_ESP -DSPIFFS_TEST_VISUALISATION -g 
gcc -I. -I. -I/home/rojer/cesanta/qwe/spiffs/tools/.. -o sptest sptest.c mem_spiffs.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_cache.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_gc.c /home/rojer/cesanta/qwe/spiffs/tools/../spiff
s_nucleus.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_check.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_hydrogen.c /home/rojer/cesanta/qwe/spiffs/tools/../spiffs_vfs.c -Wall -Werror -DNO_ESP -DSPIFFS_TEST_VISUALISATION -g 
[rojer@nba ~/cesanta/qwe/spiffs/tools master]$ ./sptest 0xdb000.bin 38 && ./spls boom.bin 0xc9c55669
=== Will fail at write #38
...
-- files:
 server.key.pem: 916 0x055a86e7
 conf1.json: 803 0x1b18434d
cannot read sys_config_schema.json, err: -10004

rojer avatar Mar 16 '17 16:03 rojer

Hi @rojer, looked into it a little. The fact that an untouched file becomes corrupt during an aborted GC is really no mystery, as all files in the block being cleaned are actually pretty brutally touched during the gc. As for the error, a SPIFFS_check seem to fix it IIANM.

pellepl avatar Mar 16 '17 20:03 pellepl

so do you then recommend that SPIFFS_check be run on every mount?

rojer avatar Mar 16 '17 20:03 rojer

the practical upshot of it that files just go missing and device gets bricked... not ideal :)

rojer avatar Mar 16 '17 20:03 rojer

No, that's not very ideal I agree. There is some info on it here: https://github.com/pellepl/spiffs/wiki/FAQ#powerlosses-contd-when-should-i-run-spiffs_check

I do not know what system or what application you're running it on, but if there is a way to detect a powerloss amidst a file operation, then check should be invoked. Otherwise, yes, on cold start check should run.

pellepl avatar Mar 16 '17 20:03 pellepl

I'm coming at this from the nodemcu project -- I think that we should add a spiffs-check on each restart (as we have no idea when the power went away). I might try adding some 'forced poweroff' logic to my stress tester (in PR https://github.com/pellepl/spiffs/pull/100 ) to make sure that it works.

pjsg avatar Mar 16 '17 20:03 pjsg

yeah, same here. there's no way to tell, in general. so - any chance that GC can be modified to not require a check? also, how expensive is check? doing it on every boot may be expensive, considering most boots are normal.

rojer avatar Mar 16 '17 20:03 rojer

The GC cannot be modified to not require a check, as it is now. Unfortunately, a check can be very expensive - it depends on how big the fs is and how many files it contains, but mostly if it needs to mend anything. I don't think a check on every reboot is necessary though. Normally, embedded systems distinguish from controlled reboots by writing some magic to some ram that is not cleared (not in .bss region) on startup. One could reserve some ram for another magic which is updated at completion of an fs operation and destroyed at start of an fs operation. I also have some ideas regarding spiffs that would make the whole check procedure a lot smaller, basically by introducing a journal page. This way there's no need to do a grand full-system healing, but a more directed one if needed. Also, there are plans of rewriting the block handling and having indirect blocks. This way, metadata not residing in cleaned block does not have to be updated on a gc. This would lead to a more effective gc and would, ultimately, lead to a gc not needing a check if interrupted. Unfortunately, I'm pretty stuck up with work and life right now, and probably will be for a while.

pellepl avatar Mar 16 '17 21:03 pellepl

as providers of middleware, we unfortunately cannot assume existence of energy-independent storage. i think i'll have to go with SPIFFS_check by default, for now.

rojer avatar Mar 16 '17 21:03 rojer

this didn't go down very well. looks like SPIFFS_check has some sort of bug, because with the change above we started getting pretty frequent corruptions - even on planned and scheduled reboots, where we SPIFFS_unmount the fs first, i.e. where check should've been a no-op. corruptions are different in nature to the ones caused by GC: preexisting files do not disappear, but the file that was written just before reboot gets corrupted, its content gets scrambled (but no read errors). neither read not write cache is used. i'll look more into that, and probably file a separate issue.

rojer avatar Mar 21 '17 17:03 rojer

Sound bad, having a check corrupting things is a tad counter intuitive to say the least. Yes, please file as another issue. All info you can collect is valuable. Cheers / P

Den 21 mars 2017 18:51 skrev "Deomid Ryabkov" [email protected]:

this didn't go down very well. looks like SPIFFS_check has some sort of bug, because with the change above we started getting pretty frequent corruptions - even on planned and scheduled reboots, where we SPIFFS_unmount the fs first, i.e. where check should've been a no-op. corruptions are different in nature to the ones caused by GC: preexisting files do not disappear, but the file that was written just before reboot gets corrupted, its content gets scrambled (but no read errors). neither read not write cache is used. i'll look more into that, and probably file a separate issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pellepl/spiffs/issues/137#issuecomment-288163266, or mute the thread https://github.com/notifications/unsubscribe-auth/AE3NV1QspcGZkljTGnFAlJ0nUM7VnGKaks5roA4fgaJpZM4MffWW .

pellepl avatar Mar 21 '17 17:03 pellepl

ok, let me close the loop on this last one: this was a false alarm, something else was wrong with the setup. i no longer suspect SPIFFS_check.

rojer avatar Apr 04 '17 17:04 rojer

Good to hear. Did you ind out what the problem was?

pellepl avatar Apr 23 '17 18:04 pellepl