Technology RadarTechnology Radar

DJL + ONNX Runtime

sdk
Trial

Deep Java Library (DJL) + ONNX Runtime is the production-proven stack for running ML models directly inside the JVM — no Python sidecar, no network hop to a model server, models embedded in your Java service.

Why It's in Trial

For Java teams that train models in Python and need to deploy them in a Java backend, DJL + ONNX is the mature, well-understood path. Netflix uses it for real-time inference at 7ms latency per event. Fintech companies use it for fraud detection in payment pipelines.

It's Trial rather than Adopt because most Java teams doing AI today are calling LLM APIs (Spring AI, LangChain4j), not running embedded models — but for teams with latency requirements or data sovereignty concerns that preclude API calls, this is the right stack.

The "Train in Python, Deploy in Java" Pattern

Python (training)           Java (serving)
─────────────────           ──────────────
PyTorch / scikit-learn  →   Export to ONNX  →  DJL + ONNX Runtime
Hugging Face model      →   .onnx file      →  Embedded in JAR

Why ONNX as the bridge format?

  • Framework-agnostic: export from PyTorch, TensorFlow, scikit-learn, XGBoost, or Hugging Face
  • Quantization: shrink model size ~4x with int8 quantization, 2–3x faster CPU inference
  • Stable: binary format with versioned spec — your .onnx file runs identically across engines and platforms

Setting Up DJL with ONNX Runtime

<dependency>
  <groupId>ai.djl</groupId>
  <artifactId>api</artifactId>
  <version>0.31.0</version>
</dependency>
<dependency>
  <groupId>ai.djl.onnxruntime</groupId>
  <artifactId>onnxruntime-engine</artifactId>
  <version>0.31.0</version>
</dependency>

Loading and running a model:

Criteria<String, Classifications> criteria = Criteria.builder()
    .setTypes(String.class, Classifications.class)
    .optModelPath(Paths.get("models/sentiment.onnx"))
    .optTranslator(new SentimentTranslator())
    .optEngine("OnnxRuntime")
    .build();

try (ZooModel<String, Classifications> model = criteria.loadModel();
     Predictor<String, Classifications> predictor = model.newPredictor()) {

    Classifications result = predictor.predict("This product works great!");
    System.out.println(result.best().getClassName()); // "POSITIVE"
}

Production Performance

Netflix's numbers:

  • Character-level CNN + Universal Sentence Encoder inference: 7ms per event
  • Processing: real-time log stream classification
  • Infra: DJL on JVM inside existing Java services — no Python deployment

General benchmarks with ONNX + int8 quantization vs full float32:

  • Model size: ~4x smaller
  • CPU inference speed: 2–3x faster
  • Accuracy loss: typically <1%

DJL's NDManager handles C++ memory management automatically — tested at 100+ hours continuous inference with no memory leaks.

When to Choose This Over API Calls

Scenario Use API (Spring AI) Use DJL + ONNX
General LLM text generation
Classification / scoring at <10ms
Data sovereignty (no external calls)
Custom fine-tuned model
Cost: millions of inferences/day

Key Characteristics

Property Value
Backed by Amazon Web Services
Supported engines ONNX Runtime, PyTorch, TensorFlow, MXNet
Java version 8+
Notable users Netflix, fintech fraud detection