Unable to save model
I am using spark 3.1 and Scala 2.12. I am using below isolation forest model artifact in maven.
<groupId>com.linkedin.isolation-forest</groupId> <artifactId>isolation-forest_3.0.0_2.12</artifactId>
Recently I started getting below error
java.lang.NoClassDefFoundError: org/json4s/JsonAssoc$ at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelWriter.saveImpl(IsolationForestModelReadWrite.scala:239) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168)
Below is our code.
def generateAnomalyScoreUsingIsolationForest(spark: SparkSession, year: String, month: String, day: String): Unit = {
spark.conf.set("spark.sql.legacy.replaceDatabricksSparkAvro.enabled", "true")
val model_path = f"/iforest_$year%s_$month%s_$day%s.model"
val data_path = f"/anomalyScores_$year%s_$month%s_$day%s.parquet/"
val df_final_table = spark.sql("select * from AppFeatures_v2")
val cols = df_final_table.columns
val labelCol = cols.slice(0,1).mkString("")
val assembler = new VectorAssembler().setInputCols(cols.slice(1, cols.length)).setOutputCol("features")
val data = assembler.transform(df_final_table).select(col("features"), col(labelCol).as("label"))
val contamination = 0.002
val max_samples = 0.3
val max_features = 0.4
val num_estimator = 1000
val isolationForest = (new IsolationForest()
.setNumEstimators(num_estimator)
.setBootstrap(false)
.setMaxSamples(max_samples)
.setMaxFeatures(max_features)
.setFeaturesCol("features")
.setPredictionCol("predictedLabel")
.setScoreCol("outlierScore")
.setContamination(contamination)
.setContaminationError(0.01 * contamination)
.setRandomSeed(21))
val isolationForestModel = isolationForest.fit(data)
val dataWithScores = isolationForestModel.transform(data)
// Failing on below line
isolationForestModel.write.overwrite().save("/iforest_latest.model")
isolationForestModel.write.overwrite().save(model_path)
dataWithScores.select("label", "predictedLabel","outlierScore").write.mode("overwrite").option("overwriteSchema", "true").parquet(data_path)
}
It was working till couple of weeks ago. Can anyone help to solve this problem?
You mentioned that it was working until several weeks ago. Has anything changed on your side (e.g., Spark / Scala versions used on your cluster)?
There are several json issues with model I/O reported and solved in prior tickets. I'd suggest taking a look at these and seeing if any are relevant.
There are isolation-forest artifacts built for Spark 3.1.1 and Scala 2.12 (Maven Central). I'd suggest using a version that matches your environment.
Closing this as there have been no replies for several months.