Number of WAL files for a tserver can become large
I've been experiencing a case where TabletServers will hold on to many more WAL logs than normal in a (seemingly) unbounded fashion until it is restarted. As a byproduct, it causes extremely long recovery periods after the restart. TabletServers will have hundreds or thousands of WAL entries in ZooKeeper instead of a normal 3-5.
Accumulo Version 1.9.3
@adamjshook And these don't go away after flush? Is the accumulo-gc running? Probably not related (since it sounds like these WALs were still in use), but we saw (and fixed) a problem with the accumulo-gc having a memory leak that could cause it to get slower and slower cleaning up old files/logs in #1314 . The fix will be in 1.9.4, but it's a trivial backport to 1.9.3.
Actually, this sounds related to #854 / #860 / #1008 ; not sure if the fix in 8b6aaa57126ed1f1fe6e8485c7bd559ff64f9b54 would be easy to backport to 1.9.
The GC is running on its typical 5 minute interval and reporting a large number of candidates; seems to be in working order. We have an aggressive 5 minute max age, the default 1GB max size. and table.compaction.minor.logs.threshold also at a default of 3. I can manually flush all tables and see if it helps clean things up. The issue has been present for a little over two weeks now; just noticed it today.
Would the link commit help? If we already have table.compaction.minor.logs.threshold set to 3, this should work even without the new property tserver.walog.max.referenced, no?
Would the link commit help? If we already have
table.compaction.minor.logs.thresholdset to 3, this should work even without the new propertytserver.walog.max.referenced, no?
On second thought, probably not, unless you have it set to different values for different tables. If you're running 1.9.3, you should have the commit from #860, which attempts to mitigate this a little bit by forcing some tablets to flush... but I don't think it solves it completely, especially if you have many many tablets, and they're all actively ingesting.
Manually flushing all tables did not help. I was doing some digging and I noticed, for each TServer that has many WAL file entries in ZooKeeper, the Tablet Server Status page in the Monitor is reporting tablets for a table that no longer exists, e.g. (ID:134) instead of the normal table name. All other TServers that are operating normally do not have any tablets like this. I think this symptom correlates with when I deleted the table ID 134. Maybe there is something there?
Edit: Digging through logs and metrics -- at the time the WAL candidates started to climb, I did delete table 134.
Strange. Is there any chance the FATE operation for the table 134 is still running, or was it aborted? Any hints in the master log? Any entries for that table in accumulo.metadata?
Nothing in accumulo.metadata. Nothing currently listed in the FATE list. From the master logs it seemed to be deleted okay. If memory serves, I had to abort a FATE transaction for a bulk import on this table that had extremely large rows; potentially millions of entries with the same row ID. We ended up not using the table; the bulk import was executed for testing but it never went to production. I finally dropped it a couple weeks ago.
Hmm. I wonder if whatever logic unlinks the WAL from the tablets when a tablet is flushed isn't being executed when the tablet is unloaded (for deletion or possibly also migration)?
Actually, for migrations, it should flush first... this might be specific to deletion, because there's no point in flushing for a table that is deleted. Can investigate more tomorrow.
@keith-turner , I noticed you self-assigned this about a month ago. Are you still looking into this? Have you made any progress?
Does this issue exist in any version later than the reported 1.9.3?
Yes, unfortunately it happened in our prod environment using accumulo 2.1.3/hadoop 3.2.2 once about two weeks ago. In our prod accumulo 2.0.0 cluster it happens more often.