rasterframes Column name conflict when using raster_join on data that was already raster joined.

Py4JJavaError: An error occurred while calling o59.rasterJoin.
: org.apache.spark.sql.AnalysisException: Reference 'spatial_index_agg' is ambiguous, could be: spatial_index_agg, spatial_index_agg.;

Easy to reproduce, just try to raster_join 3 rasters. On second join error above is shown. Current solution is to df.drop('spatial_index_agg') before join.

Feb 14 '20 08:02 mjgolebiewski

@mjgolebiewski What would you expect the automatic behavior to be? Do you think random characters should be added? Some other mechanism?

Note: Any columns in the RHS dataframe are going to be propagated to the joined data frame as lists.

Feb 14 '20 18:02 metasim

if not random characters then maybe some related to joined dataframes names? i am still exploring raster_join and its outputs so im not sure.

Feb 18 '20 12:02 mjgolebiewski

@mjgolebiewski What do you mean by "joined dataframes names"? If you mean the name of the variables referencing them, then there's no way to get that information from within raster_join. My suspicion is that the behavior is typical Spark behavior, in that you have to take care of renaming columns before joins to keep them unique.

Feb 18 '20 14:02 metasim

From a pandas user perspective and also experience with R data.frame, I would expect either:

All column names are appended by a distinguishing string indicating the side of the join they came from : ('_left', '_right') or ('_x', '_y'). These strings may be an argument to the join method
Only column names appearing in both DataFrames are disambiguated by appending in such a fashion

Feb 18 '20 20:02 vpipkt