mleap Question: Alternative to MLeap for Real-Time Inference Without Spark Context with SparkXGBClassifier

Question: Alternative to MLeap for Real-Time Inference Without Spark Context

We are exploring alternatives to MLeap for running inference without Spark, since MLeap has limitations with Spark/PySpark version compatibility and library updates.

Our Setup & Goal

Environment: PySpark 3.5.5
Algorithm: Distributed ML training using XGBoost with Spark.
Goal: Run real-time inference without requiring a Spark session/context, to reduce overhead and response latency.

What We Did

Took a dataset (Titanic), converted it to Parquet, and split it into 80% (train) and 20% (test).
Trained with Spark (80% data) including preprocessing + XGBoost.
Evaluated on Spark (20% data) and logged the trained model.
Tried multiple logging/serialization approaches:
- MLflow pyfunc
- ONNX
- XGBoost native model
For inference: loaded the same 20% data, applied preprocessing outside Spark, reloaded the trained model, and ran predictions.

The Problem

In all approaches tested (MLflow pyfunc, ONNX, XGBoost native save/load), accuracy differs between:
- Spark-based evaluation (during training)
- Non-Spark inference (real-time service)
It seems precision is lost when the model is saved and reloaded outside Spark.

Main Requirement

The accuracy from Spark-based evaluation and non-Spark inference must match.
Need a solution to serialize/deserialize models that works across Spark training and non-Spark inference.
Prefer portable formats (JSON or similar).
Must avoid Spark context overhead at inference for real-time serving.

Question

👉 Is there any solution or alternative to MLeap for serving models trained with Spark (e.g., XGBoost with PySpark), but performing inference outside of Spark (lightweight, real-time)?

Should support PySpark 3.5.5
Must work with XGBoost distributed training
Should prevent accuracy mismatch between Spark and non-Spark inference
JSON or portable serialization preferred

Any recommendations for frameworks, libraries, or best practices beyond MLeap would be greatly appreciated.

Aug 21 '25 05:08 himadri-bhattacharjee

I have made changes in mleap source code, to work with spark 4.0.x with scala 2.13 , I can share the jars with you

Sep 03 '25 10:09 mtsol

@mtsol Are updated all project? I'll try and can't fix some tests.

Oct 02 '25 09:10 mitgard

@mitgard , I wasn't using tensorflow so I skipped its jar, I have the others.

Oct 02 '25 12:10 mtsol