accumulo-testing icon indicating copy to clipboard operation
accumulo-testing copied to clipboard

Security.xml throws exception with randomwalk

Open DomGarguilo opened this issue 3 years ago • 7 comments

When running ./bin/rwalk Security.xml, or any other module that uses the security module, and exception will be thrown whenever the AuthenticateSystemWrong node is run.

2022-09-08T16:48:59,788 [testing.randomwalk.Framework] ERROR: Error during random walk
java.lang.Exception: Error running node AuthenticateSystemWrong
	at org.apache.accumulo.testing.randomwalk.Module.visit(Module.java:369) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.testing.randomwalk.Framework.run(Framework.java:54) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.testing.randomwalk.Framework.main(Framework.java:98) ~[accumulo-testing-shaded.jar:?]
Caused by: org.apache.accumulo.core.client.AccumuloException: Authentication succeeded when it should have failed while the user exists? true
	at org.apache.accumulo.testing.randomwalk.security.Authenticate.authenticate(Authenticate.java:101) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.testing.randomwalk.security.Authenticate.visit(Authenticate.java:40) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.testing.randomwalk.Module.lambda$visit$0(Module.java:301) ~[accumulo-testing-shaded.jar:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
2022-09-08T16:48:59,801 [testing.randomwalk.Framework] INFO : Test finished

There is a log "Test finished" afterwards but it just hangs at this point and blocks.

This exception is thrown when the following case is statisfied:

https://github.com/apache/accumulo-testing/blob/ae9c46552a661c2dec132d26835cec5dd2ecbf66/src/main/java/org/apache/accumulo/testing/randomwalk/security/Authenticate.java#L83-L86

I'm not sure if there is an error in the logic that throws this exception or if there is a valid error occurring here.

DomGarguilo avatar Sep 08 '22 21:09 DomGarguilo

I have not looked at this in a long time, so not completely sure the following explanation is completely correct.

I think the security RW test are broken because the assume that the security APIs are immediately consistent when they are actually eventually consistent. The security RW test may work better if run w/ a single tablet server but not sure.

I think when an Accumulo client does a security operation it picks a random tablet server to actually carry out that work. The tablet servers store the security information in Zookeeper and use ZooCache to keep a copy of it. So when one tserver does a security operation it will eventually cause the zoocaches of other tablet servers to update. That could lead to the following

  1. Client A randomly contacts tserver 1 to update some security information
  2. Client A randomly contacts tserver 2 to read the security information it just updated. Because ZK has not propagated the update yet to tserver 2, Client A does not see the update.

Again not sure if my recollection matches the current code.

keith-turner avatar Sep 09 '22 14:09 keith-turner

The security RW test may work better if run w/ a single tablet server but not sure.

When running with a single tserver it seems like this exception gets thrown just as often.

DomGarguilo avatar Sep 09 '22 15:09 DomGarguilo

I see some inconsistency when running the security module. At times my error is just what you described above. Other times I get a different error followed by the one above. And rarely, I get a completely different error. I will post the logs below.

This one I got followed then by the one posted above. This usually happened on a second run through and it seemed the module got further along
2022-09-10T14:44:36,497 [randomwalk.security.TableOp] INFO : Caught MutationsRejectedException(PERMISSION_DENIED) in TableOp.WRITE as expected.
2022-09-10T14:44:36,497 [core.clientImpl.TabletServerBatchWriter] ERROR: Failed to send tablet server groot:9997 its batch : Error PERMISSION_DENIED for user table_groot - User does not have permission to perform this action
org.apache.accumulo.core.client.AccumuloSecurityException: Error PERMISSION_DENIED for user table_groot - User does not have permission to perform this action
	at org.apache.accumulo.core.clientImpl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:1015) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.core.clientImpl.TabletServerBatchWriter$MutationWriter$SendTask.send(TabletServerBatchWriter.java:858) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.core.clientImpl.TabletServerBatchWriter$MutationWriter$SendTask.run(TabletServerBatchWriter.java:816) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.accumulo.core.clientImpl.thrift.ThriftSecurityException
	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$update_result$update_resultStandardScheme.read(TabletClientService.java:9080) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$update_result$update_resultStandardScheme.read(TabletClientService.java:9066) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$update_result.read(TabletClientService.java:9000) ~[accumulo-testing-shaded.jar:?]
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:88) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_update(TabletClientService.java:280) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.update(TabletClientService.java:263) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.core.clientImpl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:931) ~[accumulo-testing-shaded.jar:?]
	... 7 more


This one I got rarely and by itself.
22-09-10T14:46:32,567 [testing.randomwalk.Framework] ERROR: Error during random walk
java.lang.Exception: Error running node security.TableOp
	at org.apache.accumulo.testing.randomwalk.Module.visit(Module.java:364) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.testing.randomwalk.Framework.run(Framework.java:54) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.testing.randomwalk.Framework.main(Framework.java:98) ~[accumulo-testing-shaded.jar:?]
Caused by: java.lang.UnsupportedOperationException: GET_SUMMARIES not implemented
	at org.apache.accumulo.testing.randomwalk.security.TableOp.visit(TableOp.java:296) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.testing.randomwalk.Module.lambda$visit$0(Module.java:299) ~[accumulo-testing-shaded.jar:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
2022-09-10T14:46:32,577 [testing.randomwalk.Framework] INFO : Test finished

Manno15 avatar Sep 10 '22 18:09 Manno15

More testing needs to be done, but my current guess is the initial error is caused by the user not being dropped as it is supposed to be before the authentication (as seen in the xml). This happens in the NoUserNoTable node (the first large node). Though the error isn't very clear there.

Manno15 avatar Sep 11 '22 18:09 Manno15

I think the security RW test are broken

I agree. I dropped the test from the other modules in https://github.com/apache/accumulo-testing/pull/221

milleruntime avatar Sep 14 '22 16:09 milleruntime

I narrowed down the line of code that causes the module to fail. It fails on the expected authenticate fail case which all it does is increment the password passed into authenticateUser which should cause the check to fail. Though at times once it gets to https://github.com/apache/accumulo/blob/cf57e343de77f416de10c22e0ed05ef37f433b36/server/base/src/main/java/org/apache/accumulo/server/security/handler/ZKAuthenticator.java#L231, which verifies the cryptPass, it passes.

That function compares the crypt hash with the new password token that is passed in from the module. Based on the password incrementation, it should return false however, it seems a decent amount of times it returns true instead. Here: https://github.com/apache/accumulo/blob/cf57e343de77f416de10c22e0ed05ef37f433b36/server/base/src/main/java/org/apache/accumulo/server/security/handler/ZKSecurityTool.java#L134-L136. If this returns true, then the entire module returns true when it is expected to be false.

Not quite sure why this is occurring since it is seemingly random on when the module fails or when it passes but is consistent on which point it fails at.

Manno15 avatar Oct 02 '22 17:10 Manno15

This can happen just as often with a single tserver because the manager can also be selected to perform the security operation, so the tserver is still only eventually consistent with the ZooKeeper quorum that was updated by the manager.

ctubbsii avatar Oct 03 '22 06:10 ctubbsii