segmentation fault on latest version
Hi,
We met segmentation fault again with the latest beansdb version. All the data is generated by the new version. This crash happens on beansdb that we are deleting old data (if we never delete data, it will not crash)
some possible error log output in the beansdb-error.log:
2015-09-07 07:07:30.302819 ERROR (0x1346c700:record.c:184) - invalid ksz=0, vsz=0, wbuf @891486464, key = (-896948590465773564) 2015-09-07 07:07:30.302990 ERROR (0x1346c700:record.c:380) - read file fail, /data/running/beansdb/storage/f/1/043.data @891486464, file size = 654926336, key = -896948590465773564 2015-09-07 07:07:30.303022 ERROR (0x1346c700:bitcask.c:1099) - Bug: get -896948590465773564 failed in /data/running/beansdb/storage/f/1/043.data @ 891486464
And the gdb info
(gdb) bt #0 decode_record (buf=0x7fd4ef3a7f40 Address 0x7fd4ef3a7f40 out of bounds,
size=4043926784, decomp=true, path=0x42b242 "wbuf", pos=884000768,
key=0x7fd3c434a864 "5335789194110384689", do_logging=true, fail_reason=0x0)
at record.c:180
#1 0x0000000000413cc9 in bc_get (bc=0x18a7b80,
key=0x7fd3c434a864 "5335789194110384689", ret_pos=<value optimized out>,
return_deleted=<value optimized out>) at bitcask.c:1006
#2 0x00000000004179f0 in hs_get (store=0x11d1570,
key=0x7fd3c434a864 "5335789194110384689", vlen=0x7fd51526e51c,
flag=0x7fd51526e518) at hstore.c:397
#3 0x00000000004084ea in item_get (key=0x7fd3c434a864 "5335789194110384689",
nkey=19) at item.c:221
#4 0x0000000000405fc3 in process_get_command (c=0x7fd4983b4ce0,
command=<value optimized out>) at beansdb.c:918
#5 process_command (c=0x7fd4983b4ce0, command=
at beansdb.c:1214
#6 0x00000000004073e4 in try_read_command (c=0x7fd4983b4ce0) at beansdb.c:1362
#7 drive_machine (c=0x7fd4983b4ce0) at beansdb.c:1590
#8 0x0000000000408b42 in worker_main (arg=
at thread.c:218
#9 0x00000033a08079d1 in start_thread () from /lib64/libpthread.so.0 #10 0x00000033a00e88fd in clone () from /lib64/libc.so.6
Hope to get your response soon:)
@xuyin224 Thanks for your feedback. Can you provide us with more info ? e.g.
- Full details of your operating system (or distribution) e.g. 64-bit Ubuntu 14.04.
- A small test case, if applicable, that demonstrates the issues.
- How did you delete your data,
mc.delete?
'This crash happens on beansdb that we are deleting old data (if we never delete data, it will not crash)'
Remember the golden rule of bug reports: The easier you make it for us to reproduce the problem, the faster it will get fixed.
2.6.32-431.20.3.el6.x86_64 centos 6.5 delete uses the your golang library, using Delete. It a daily operation, run every 7:00 am. Everyday, it will delete data that is 60 days before, and than use flash_all with time parameter. It happens on live environment, the setting/getting pattern is around 50 set/s and 1000 get/s on the peak hour. With such load, it crashes less than one week after we start the delete. The server is equipped with 3.2T FusionIO SSD storage
We only allocate one bucket to the server with 2-layer folder, each file max 1 GB, the total data is around 600 when crash. The start command is /usr/local/bin/beansdb -p 19900 -c 2048 -t 32 -T 2 -F 1024 -H storage -v 1
Update on the issue. Our server crashed again today. More seriously, it crashes when it is doing flush_all, beansdb cannot restart successful with the following error: 2015-09-09 08:04:51.112419 NOTICE (0x55b7f700:beansdb.c:2177) - ZLOG inited 2015-09-09 08:04:51.148101 FATAL (0x55b7f700:diskmgr.c:47) - basename not match /data/running/beansdb/storage/f/3/027.data->/data/running/beansdb/storage/f/3/028.data 2015-09-09 08:04:51.148129 FATAL (0x55b7f700:bitcask.c:309) - find bad symlink /data/running/beansdb/storage/f/3/027.data->/data/running/beansdb/storage/f/3/028.data, type = 0, bucket = -1 2015-09-09 08:04:51.148142 FATAL (0x55b7f700:bitcask.c:439) - bitcask 0xf3 check failed, exit!
make sure mannually that basename of link src and targe should be the same, otherwise beansdb can not start.
Yah, we manually restart it successfully. But I hope beansdb can handle this automatically. That is, beansdb should restart automatically from crash when flush_all. Anyway, this is less urgent, but the crash itself is more urgent. Keep your time to solve the bug and let me know if any issues. We stop the delete operation for now, but we eventually need to delete. We are running beansdb on FusionIO SSD as a kind of cache.
- 027.data -> 028 may happen only when GC(optimize) the last data file.
- this link check on start is mainly used for multi-disk, if it bothers you, for now, suggest you run a script to delete links before restart beansdb (if you can identify them).
We haven`t met crash like this, so may need more information.
- what op lead to crash? by "delete", you mean GC(flush_all) or deleting key (through mc client and beanseye) ? what is your delete and gc period/patten?
- does crash happens right after you start the op, or after a random time?
- does error log of each crash looks like the first one? if not, the more the better.
- delete means, we first delete keys with setting time older than T with client, then use flush_all with time parameter T. For this case, we start the delete script on every 7 am.
- We run the delete for 5 days, in the first three days, no crash happens. In the fourth day, the crash happens some minutes after flush_all finishes, in the fifth day, the crash happens during the flush_all. We suspect, it is because we access certain keys that cause a point to invalid memory.
- there are many errors same as the first one, but I think that error log may not be the one causes the crash. The crash may have no error log. You can check from the gdb trace in the first post for the detail of crash.