Column name conflict when using raster_join on data that was already raster joined.
Py4JJavaError: An error occurred while calling o59.rasterJoin.
: org.apache.spark.sql.AnalysisException: Reference 'spatial_index_agg' is ambiguous, could be: spatial_index_agg, spatial_index_agg.;
Easy to reproduce, just try to raster_join 3 rasters. On second join error above is shown. Current solution is to df.drop('spatial_index_agg') before join.
@mjgolebiewski What would you expect the automatic behavior to be? Do you think random characters should be added? Some other mechanism?
Note: Any columns in the RHS dataframe are going to be propagated to the joined data frame as lists.
if not random characters then maybe some related to joined dataframes names? i am still exploring raster_join and its outputs so im not sure.
@mjgolebiewski What do you mean by "joined dataframes names"? If you mean the name of the variables referencing them, then there's no way to get that information from within raster_join. My suspicion is that the behavior is typical Spark behavior, in that you have to take care of renaming columns before joins to keep them unique.
From a pandas user perspective and also experience with R data.frame, I would expect either:
-
All column names are appended by a distinguishing string indicating the side of the join they came from :
('_left', '_right')or('_x', '_y'). These strings may be an argument to the join method -
Only column names appearing in both DataFrames are disambiguated by appending in such a fashion