kyuubi icon indicating copy to clipboard operation
kyuubi copied to clipboard

[Bug] Kyuubi Spark authorization plugin with Iceberg tables on Iceberg snapshot retrieval Permission denied

Open elisabetao opened this issue 2 years ago • 4 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

Describe the bug

When using Ranger hive as source for Kyuubi Spark authorization plugin with Iceberg tables we're getting "Permission denied" on Iceberg snapshot ID data retrieval, like in the example below: "select * from iceberg.test.customers.snapshot_id_7801393477815178085",although in Ranger corresponding account has select and read rights on the test database, we are getting the following error An error was encountered:

An error occurred while calling o165.toJavaRDD.
: org.apache.kyuubi.plugin.spark.authz.AccessControlException: Permission denied: user [svc_df_big-st] does not have [select] privilege on [test.customers/snapshot_id_7801393477815178085/id]
	at org.apache.kyuubi.plugin.spark.authz.ranger.SparkRangerAdminPlugin$.verify(SparkRangerAdminPlugin.scala:172)
	at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5(RuleAuthorization.scala:93)
	at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5$adapted(RuleAuthorization.scala:92)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.org$apache$kyuubi$plugin$spark$authz$ranger$RuleAuthorization$$checkPrivileges(RuleAuthorization.scala:92)
	at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:37)
	at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:33)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:91)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:125)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:121)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:117)
	at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:135)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:153)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:150)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:172)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:171)
	at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3247)
	at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3245)
	at org.apache.spark.sql.Dataset.toJavaRDD(Dataset.scala:3257)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Traceback (most recent call last):
  File "/srv/ssd1/yarn/nm/usercache/svc_df_big-st/appcache/application_1701151368547_109009/container_e381_1701151368547_109009_01_000001/pyspark.zip/pyspark/sql/dataframe.py", line 117, in toJSON
    return RDD(rdd.toJavaRDD(), self._sc, UTF8Deserializer(use_unicode))
  File "/srv/ssd1/yarn/nm/usercache/svc_df_big-st/appcache/application_1701151368547_109009/container_e381_1701151368547_109009_01_000001/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/srv/ssd1/yarn/nm/usercache/svc_df_big-st/appcache/application_1701151368547_109009/container_e381_1701151368547_109009_01_000001/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/srv/ssd1/yarn/nm/usercache/svc_df_big-st/appcache/application_1701151368547_109009/container_e381_1701151368547_109009_01_000001/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o165.toJavaRDD.
: org.apache.kyuubi.plugin.spark.authz.AccessControlException: Permission denied: user [svc_df_big-st] does not have [select] privilege on [test.customers/snapshot_id_7801393477815178085/id]
	at org.apache.kyuubi.plugin.spark.authz.ranger.SparkRangerAdminPlugin$.verify(SparkRangerAdminPlugin.scala:172)
	at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5(RuleAuthorization.scala:93)
	at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5$adapted(RuleAuthorization.scala:92)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.org$apache$kyuubi$plugin$spark$authz$ranger$RuleAuthorization$$checkPrivileges(RuleAuthorization.scala:92)
	at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:37)
	at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:33)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:91)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:125)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:121)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:117)
	at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:135)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:153)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:150)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:172)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:171)
	at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3247)
	at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3245)
	at org.apache.spark.sql.Dataset.toJavaRDD(Dataset.scala:3257)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

However if the test account is granted Hive access to read all databases there's no permission issue, however the * databases read access should not be normally necessary for this access to be allowed. Is there a Kyuubi Spark plugin authorization bug preventing this? The patch at https://github.com/apache/kyuubi/pull/3931/files doesn't seem to cover this scenario.

Thanks

Affects Version(s)

1.8.0

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

No response

Kyuubi Server Configurations

No response

Kyuubi Engine Configurations

No response

Additional context

We are using Spark Kyuubi Authorization Plugin with Spark 3.2 and Iceberg 1.0.0.1.3.1 as described here: https://kyuubi.readthedocs.io/en/master/security/authorization/spark/install.html

Are you willing to submit PR?

  • [ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • [ ] No. I cannot submit a PR at this time.

elisabetao avatar Dec 01 '23 18:12 elisabetao

Can you provide the plan details?

yaooqinn avatar Dec 05 '23 11:12 yaooqinn

Hello, Please let me know if more details are needed, this is also after applying patch #5248 https://github.com/apache/kyuubi/commit/724ae93989e7e64f858b5c621ef28e9b17e45f99 |== Physical Plan ==\n*(1) Project [id#32, name#33, age#34, address#35, cloth#36]\n+- BatchScan[id#32, name#33, age#34, address#35, cloth#36] iceberg.gns_test.customers [filters=] RuntimeFilters: []\n\n|

which appears to alleviate the access issue for iceberg.test.customers.snapshot_id_X, but introduces another issues where the metadata info like snapshots,history is freely accessible without any ranger security checks.

Thanks a lot

elisabetao avatar Dec 06 '23 19:12 elisabetao

thanks @elisabetao, we need the full plan

yaooqinn avatar Dec 07 '23 01:12 yaooqinn