DJL + ONNX Runtime

Mar 2026

Trial

Deep Java Library (DJL) + ONNX Runtime is the production-proven stack for running ML models directly inside the JVM — no Python sidecar, no network hop to a model server, models embedded in your Java service.

Why It's in Trial

For Java teams that train models in Python and need to deploy them in a Java backend, DJL + ONNX is the mature, well-understood path. Netflix uses it for real-time inference at 7ms latency per event. Fintech companies use it for fraud detection in payment pipelines.

It's Trial rather than Adopt because most Java teams doing AI today are calling LLM APIs (Spring AI, LangChain4j), not running embedded models — but for teams with latency requirements or data sovereignty concerns that preclude API calls, this is the right stack.

The "Train in Python, Deploy in Java" Pattern

Python (training)           Java (serving)
─────────────────           ──────────────
PyTorch / scikit-learn  →   Export to ONNX  →  DJL + ONNX Runtime
Hugging Face model      →   .onnx file      →  Embedded in JAR

Why ONNX as the bridge format?

Framework-agnostic: export from PyTorch, TensorFlow, scikit-learn, XGBoost, or Hugging Face
Quantization: shrink model size ~4x with int8 quantization, 2–3x faster CPU inference
Stable: binary format with versioned spec — your .onnx file runs identically across engines and platforms

Setting Up DJL with ONNX Runtime

<dependency>
  <groupId>ai.djl</groupId>
  <artifactId>api</artifactId>
  <version>0.31.0</version>
</dependency>
<dependency>
  <groupId>ai.djl.onnxruntime</groupId>
  <artifactId>onnxruntime-engine</artifactId>
  <version>0.31.0</version>
</dependency>

Loading and running a model:

Criteria<String, Classifications> criteria = Criteria.builder()
    .setTypes(String.class, Classifications.class)
    .optModelPath(Paths.get("models/sentiment.onnx"))
    .optTranslator(new SentimentTranslator())
    .optEngine("OnnxRuntime")
    .build();

try (ZooModel<String, Classifications> model = criteria.loadModel();
     Predictor<String, Classifications> predictor = model.newPredictor()) {

    Classifications result = predictor.predict("This product works great!");
    System.out.println(result.best().getClassName()); // "POSITIVE"
}

Production Performance

Netflix's numbers:

Character-level CNN + Universal Sentence Encoder inference: 7ms per event
Processing: real-time log stream classification
Infra: DJL on JVM inside existing Java services — no Python deployment

General benchmarks with ONNX + int8 quantization vs full float32:

Model size: ~4x smaller
CPU inference speed: 2–3x faster
Accuracy loss: typically <1%

DJL's NDManager handles C++ memory management automatically — tested at 100+ hours continuous inference with no memory leaks.

When to Choose This Over API Calls

Scenario	Use API (Spring AI)	Use DJL + ONNX
General LLM text generation	✓	—
Classification / scoring at <10ms	—	✓
Data sovereignty (no external calls)	—	✓
Custom fine-tuned model	—	✓
Cost: millions of inferences/day	—	✓

Key Characteristics

Property	Value
Backed by	Amazon Web Services
Supported engines	ONNX Runtime, PyTorch, TensorFlow, MXNet
Java version	8+
Notable users	Netflix, fintech fraud detection