kbfs icon indicating copy to clipboard operation
kbfs copied to clipboard

keybase chat commands get slower and slower

Open mortenkjarulff opened this issue 7 years ago • 24 comments

Attached script do:

while true ; do

  sleep 15

  keybase chat send ...

  ls /keybase/team/...

  echo '{"method": "read", ... | keybase chat api ...

  if 6 hours has gone
    run_keybase
  fi

done

After a while, chat send and chat read gets slower and slower, going from less that 1 second to more than 60 seconds. sleep and ls times seems constant. When 6 hours has gone and the run_keybase is executed, times get low again (but build up over the next 6 hours).

If I outcomment the ls command, all times are constant low.

These logs are from the same run, WITH the ls command: a4f72ae29df5ba7e9a4c2a1c 0e3959537531bc60a74ac21c

These logs are from the same run, WITHOUT the ls command: 41c5342a84ccee599571831c b7cc0500989c6a67920c0a1c

/Morten

mortenkjarulff avatar Oct 31 '18 15:10 mortenkjarulff

test.sh.txt

mortenkjarulff avatar Oct 31 '18 15:10 mortenkjarulff

@mortenkjarulff did you mean to file this in keybase/client? Chat is pretty much completely separate from kbfs at this point.

strib avatar Oct 31 '18 16:10 strib

Hi,

Appears not to be so separate 😀

Shall I file it there instead?

/Morten

mortenkjarulff avatar Oct 31 '18 16:10 mortenkjarulff

Oh, I guess I glossed over or didn't understand the line about "outcomment the ls".

Ok, here's what appears to be the issue. The team folder you're lsing has been written to by a device that has since been revoked. When you ls it, that triggers kbfs to attempt to "garbage-collect" the old data in it, by iterating through the old revisions of the folder. When it looks at the revisions that were written by the now-revoked device, it triggers a bunch of extra checking to verify that the device was valid at the time of the revision.

It looks like, in the case of this folder, the revisions by that revoked device are so expensive that they are slowing down chat.

Can you just ls that folder once, and maybe let the garbage collection run overnight and see if it finishes? You can tell if it made progress overnight by looking at the LastGCRevision field in the team folder's .kbfs_status file.

If it doesn't make any progress, even after many hours of work, I'll try to figure out a custom solution for you.

And obviously we need to figure out a way to reduce these extra checks without affecting security...

strib avatar Oct 31 '18 16:10 strib

Not sure I understand. I can reproduce at any time, even long time after I revoked.

I will manually ls now, and start the script tomorrow.

mortenkjarulff avatar Oct 31 '18 18:10 mortenkjarulff

We only garbage-collect data when it is at least two weeks old, but it could be longer depending on when your devices were online and when they accessed this particular folder. In your case the device was revoked on October 3rd, but it's trying to garbage-collect data written on September 29th.

strib avatar Oct 31 '18 18:10 strib

Ok, will try

Log after manual ls a0cb1d208188c2886665501c

mortenkjarulff avatar Oct 31 '18 18:10 mortenkjarulff

LastGCRevision": 0,

Is that good?

mortenkjarulff avatar Oct 31 '18 18:10 mortenkjarulff

That means it has never been successfully garbage collected. If the team/folder is very old, this is surprising and might indicate some other bug.

strib avatar Oct 31 '18 18:10 strib

Timestamp on folder is October 8, that could well be the date I created the subteam

mortenkjarulff avatar Oct 31 '18 18:10 mortenkjarulff

Looks like it was created on September 20, so that lines up with the garbage collection and revocation messages I mentioned above.

strib avatar Oct 31 '18 19:10 strib

LastGCRevision": 0, d46b49510feabaac7be4161c

If it can help

mortenkjarulff avatar Oct 31 '18 20:10 mortenkjarulff

Please just let it run for a long time uninterrupted.

strib avatar Oct 31 '18 20:10 strib

Still 0

mortenkjarulff avatar Nov 01 '18 16:11 mortenkjarulff

Hrm that's a shame. I'm going to work today on making this whole thing better.

strib avatar Nov 01 '18 18:11 strib

I put up #1893 to help eliminate much of the redundant work we're doing for revoked devices. It's under review now, but if you're feeling bold and want to try it out before it's merged officially, let me know.

strib avatar Nov 01 '18 20:11 strib

When will it be official maybe?

If it can help, how can I test it?

Would it help me if I created a new subteam, copy over and rename?

/Morten

mortenkjarulff avatar Nov 02 '18 06:11 mortenkjarulff

It probably won't be out in the official release for a while, but I've made a .deb file for you, if you feel like trying it out:

/keybase/public/strib/keybase-2.10.0-20181102162247.e4a1860091-amd64.deb

or if you trust keybase.pub:

https://keybase.pub/strib/keybase-2.10.0-20181102162247.e4a1860091-amd64.deb

Note this is not a well-tested release, so there could be some rough edges. Also, there might be multiple issues going on with your folder, and I only fixed the most obvious first one, so this debugging might take more steps. We'll see!

If you feel like trying it, please install the above deb, run run_keybase, ls your folder again, and wait a bit. If you do a log send after a few minutes I can check to see if it seems to be making progress. Thanks!

strib avatar Nov 02 '18 16:11 strib

Installed your test version. Still 0. 921efd3e424dcbad89bf081c

/Morten

mortenkjarulff avatar Nov 04 '18 14:11 mortenkjarulff

Thanks. I was wrong to say that it starts the GC process after 1 minute -- we changed it a while back to one hour. It looks like you sent the log less than an hour after you did the ls. Can you please send another log now if it's been up this whole time?

strib avatar Nov 04 '18 15:11 strib

"LastGCRevision": 9, d96ca4acddd6f20834cd6d1c

It has been running for some hours.

/Morten

mortenkjarulff avatar Nov 04 '18 20:11 mortenkjarulff

Ok cool. It at least stopped chewing up your CPU right? I'll look into why it hasn't gone past 9 yet tomorrow, but that seems like a separate issue.

strib avatar Nov 04 '18 20:11 strib

Have been running my test script all day and send/read times are constant low.

I don't think the issue was CPU, but can be wrong.

Should I reinstall the official version again, or?

/ Morten

mortenkjarulff avatar Nov 05 '18 14:11 mortenkjarulff

Great!

I wouldn't recommend re-installing the master version, as the problem could come back. Just wait for the next apt release, which will include these changes. Though if you seem to be having unexpected problems with the release you have. let me know.

strib avatar Nov 05 '18 15:11 strib