rasterframes icon indicating copy to clipboard operation
rasterframes copied to clipboard

Column name conflict when using raster_join on data that was already raster joined.

Open mjgolebiewski opened this issue 5 years ago • 4 comments

Py4JJavaError: An error occurred while calling o59.rasterJoin.
: org.apache.spark.sql.AnalysisException: Reference 'spatial_index_agg' is ambiguous, could be: spatial_index_agg, spatial_index_agg.;

Easy to reproduce, just try to raster_join 3 rasters. On second join error above is shown. Current solution is to df.drop('spatial_index_agg') before join.

mjgolebiewski avatar Feb 14 '20 08:02 mjgolebiewski

@mjgolebiewski What would you expect the automatic behavior to be? Do you think random characters should be added? Some other mechanism?

Note: Any columns in the RHS dataframe are going to be propagated to the joined data frame as lists.

metasim avatar Feb 14 '20 18:02 metasim

if not random characters then maybe some related to joined dataframes names? i am still exploring raster_join and its outputs so im not sure.

mjgolebiewski avatar Feb 18 '20 12:02 mjgolebiewski

@mjgolebiewski What do you mean by "joined dataframes names"? If you mean the name of the variables referencing them, then there's no way to get that information from within raster_join. My suspicion is that the behavior is typical Spark behavior, in that you have to take care of renaming columns before joins to keep them unique.

metasim avatar Feb 18 '20 14:02 metasim

From a pandas user perspective and also experience with R data.frame, I would expect either:

  1. All column names are appended by a distinguishing string indicating the side of the join they came from : ('_left', '_right') or ('_x', '_y'). These strings may be an argument to the join method

  2. Only column names appearing in both DataFrames are disambiguated by appending in such a fashion

vpipkt avatar Feb 18 '20 20:02 vpipkt