mleap icon indicating copy to clipboard operation
mleap copied to clipboard

Question: Alternative to MLeap for Real-Time Inference Without Spark Context with SparkXGBClassifier

Open himadri-bhattacharjee opened this issue 6 months ago • 3 comments

Question: Alternative to MLeap for Real-Time Inference Without Spark Context

We are exploring alternatives to MLeap for running inference without Spark, since MLeap has limitations with Spark/PySpark version compatibility and library updates.


Our Setup & Goal

  • Environment: PySpark 3.5.5
  • Algorithm: Distributed ML training using XGBoost with Spark.
  • Goal: Run real-time inference without requiring a Spark session/context, to reduce overhead and response latency.

What We Did

  1. Took a dataset (Titanic), converted it to Parquet, and split it into 80% (train) and 20% (test).
  2. Trained with Spark (80% data) including preprocessing + XGBoost.
  3. Evaluated on Spark (20% data) and logged the trained model.
  4. Tried multiple logging/serialization approaches:
    • MLflow pyfunc
    • ONNX
    • XGBoost native model
  5. For inference: loaded the same 20% data, applied preprocessing outside Spark, reloaded the trained model, and ran predictions.

The Problem

  • In all approaches tested (MLflow pyfunc, ONNX, XGBoost native save/load), accuracy differs between:
    • Spark-based evaluation (during training)
    • Non-Spark inference (real-time service)
  • It seems precision is lost when the model is saved and reloaded outside Spark.

Main Requirement

  • The accuracy from Spark-based evaluation and non-Spark inference must match.
  • Need a solution to serialize/deserialize models that works across Spark training and non-Spark inference.
  • Prefer portable formats (JSON or similar).
  • Must avoid Spark context overhead at inference for real-time serving.

Question

👉 Is there any solution or alternative to MLeap for serving models trained with Spark (e.g., XGBoost with PySpark), but performing inference outside of Spark (lightweight, real-time)?

  • Should support PySpark 3.5.5
  • Must work with XGBoost distributed training
  • Should prevent accuracy mismatch between Spark and non-Spark inference
  • JSON or portable serialization preferred

Any recommendations for frameworks, libraries, or best practices beyond MLeap would be greatly appreciated.

himadri-bhattacharjee avatar Aug 21 '25 05:08 himadri-bhattacharjee

I have made changes in mleap source code, to work with spark 4.0.x with scala 2.13 , I can share the jars with you

mtsol avatar Sep 03 '25 10:09 mtsol

@mtsol Are updated all project? I'll try and can't fix some tests.

mitgard avatar Oct 02 '25 09:10 mitgard

@mitgard , I wasn't using tensorflow so I skipped its jar, I have the others.

mtsol avatar Oct 02 '25 12:10 mtsol