DJL + ONNX Runtime
sdkDeep Java Library (DJL) + ONNX Runtime is the production-proven stack for running ML models directly inside the JVM — no Python sidecar, no network hop to a model server, models embedded in your Java service.
Why It's in Trial
For Java teams that train models in Python and need to deploy them in a Java backend, DJL + ONNX is the mature, well-understood path. Netflix uses it for real-time inference at 7ms latency per event. Fintech companies use it for fraud detection in payment pipelines.
It's Trial rather than Adopt because most Java teams doing AI today are calling LLM APIs (Spring AI, LangChain4j), not running embedded models — but for teams with latency requirements or data sovereignty concerns that preclude API calls, this is the right stack.
The "Train in Python, Deploy in Java" Pattern
Python (training) Java (serving)
───────────────── ──────────────
PyTorch / scikit-learn → Export to ONNX → DJL + ONNX Runtime
Hugging Face model → .onnx file → Embedded in JAR
Why ONNX as the bridge format?
- Framework-agnostic: export from PyTorch, TensorFlow, scikit-learn, XGBoost, or Hugging Face
- Quantization: shrink model size ~4x with int8 quantization, 2–3x faster CPU inference
- Stable: binary format with versioned spec — your .onnx file runs identically across engines and platforms
Setting Up DJL with ONNX Runtime
<dependency>
<groupId>ai.djl</groupId>
<artifactId>api</artifactId>
<version>0.31.0</version>
</dependency>
<dependency>
<groupId>ai.djl.onnxruntime</groupId>
<artifactId>onnxruntime-engine</artifactId>
<version>0.31.0</version>
</dependency>
Loading and running a model:
Criteria<String, Classifications> criteria = Criteria.builder()
.setTypes(String.class, Classifications.class)
.optModelPath(Paths.get("models/sentiment.onnx"))
.optTranslator(new SentimentTranslator())
.optEngine("OnnxRuntime")
.build();
try (ZooModel<String, Classifications> model = criteria.loadModel();
Predictor<String, Classifications> predictor = model.newPredictor()) {
Classifications result = predictor.predict("This product works great!");
System.out.println(result.best().getClassName()); // "POSITIVE"
}
Production Performance
Netflix's numbers:
- Character-level CNN + Universal Sentence Encoder inference: 7ms per event
- Processing: real-time log stream classification
- Infra: DJL on JVM inside existing Java services — no Python deployment
General benchmarks with ONNX + int8 quantization vs full float32:
- Model size: ~4x smaller
- CPU inference speed: 2–3x faster
- Accuracy loss: typically <1%
DJL's NDManager handles C++ memory management automatically — tested at 100+ hours continuous inference with no memory leaks.
When to Choose This Over API Calls
| Scenario | Use API (Spring AI) | Use DJL + ONNX |
|---|---|---|
| General LLM text generation | ✓ | — |
| Classification / scoring at <10ms | — | ✓ |
| Data sovereignty (no external calls) | — | ✓ |
| Custom fine-tuned model | — | ✓ |
| Cost: millions of inferences/day | — | ✓ |
Key Characteristics
| Property | Value |
|---|---|
| Backed by | Amazon Web Services |
| Supported engines | ONNX Runtime, PyTorch, TensorFlow, MXNet |
| Java version | 8+ |
| Notable users | Netflix, fintech fraud detection |