sedona-spark-shaded package should exclude some repetitive dependencies.
pom.xml should exclude any dependencies that exist in spark jars, eg:
edu.ucar:cdm-core exclude guava/httpclient/protobuf-java
... ... <dependency> <groupId>edu.ucar</groupId> <artifactId>cdm-core</artifactId> <exclusions> <exclusion> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> </exclusion> <exclusion> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> </exclusion> <exclusion> <groupId>com.google.protobuf</groupId> <artifactId>protobuf-java</artifactId> </exclusion> </exclusions> </dependency> ... ... <artifactSet> <excludes> <exclude>org.scala-lang:scala-library</exclude> <exclude>org.apache.commons:commons-*</exclude> <exclude>commons-pool:commons-pool</exclude> <exclude>commons-lang:commons-lang</exclude> <exclude>commons-io:commons-io</exclude> <exclude>commons-logging:commons-logging</exclude> </excludes> </artifactSet> ... ...
@freamdx would you mind create a PR to fix this yourself?
A few notes:
-
cdm-corehas to be bundled in the shaded package, we include it in the shaded pom.xml to make it compile rather than provided. -
guavais better shaded than excluded, since we don't know whether the guava jar shipped with spark is compatible with sedona or not, and shading guava is a common practice. - other package exclusion seems to be OK. Apache commons has good backward compatibility and the versions shipped with spark is later than what we bundled.
I agree with @Kontinuation. Guava is pretty tricky, as I learned it when upgrading Spark.
I'm having trouble understanding why this Google's library has such poor compatibility issues. It seems unexpected from a leading tech company
I'm willing to step in if @jiayuasu approves, given @freamdx's lack of response. However, I some worried proceeding without rigorous testing to mitigate potential risks.
@zwu-net please feel free to create a PR to fix this
@jiayuasu (I would like also to bring @james-willis here since I got know him due to DBScan discussion. If you are busy on other tasks, you don't need to participate) @Kontinuation @freamdx
Research Findings and Proposal
After researching the feasibility of modifying Sedona's pom.xml (https://github.com/apache/sedona/blob/master/spark-shaded/pom.xml), I concluded that manual management of library inclusions/exclusions across various Spark and Sedona versions would be overly complex and labor-intensive.
Proposed Solution
To address this challenge, I suggest creating a custom tool to manage pom.xml generation. This approach is inspired by my previous work on a custom transformer of SQLs from Oracle to Snowflake(https://www.linkedin.com/pulse/revolutionizing-data-analysis-custom-transformer-seamless-paul-wu-4euxc/?trackingId=1AyXqiJh1sei4yybC%2FZW9g%3D%3D).
Tool Requirements
The tool should:
- Fetch Spark releases (ideally from a prefetched repository to minimize build time and network stability issues)
- Generate pom.xml files
- Be implemented in Java or Python (with careful dependency management if using Python)
Implementation Considerations
Given the scope of this task, it may take several months for part-time contributors like myself to implement and test. Before proceeding, I'd appreciate feedback on:
- The viability of this approach
- Potential alternative solutions
Please share your thoughts.
- only NetCDF using ucar package
- s2-geometry package depends guava:jar:25.1-jre ( mvn dependency:tree ) as default
- mvn test passed
@freamdx would you mind create a PR to fix this yourself?
how to get a ticket ID ?
@freamdx Please sign up for an account here: https://issues.apache.org/jira/projects/SEDONA