Technology RadarTechnology Radar

Virtual Threads for AI (Project Loom)

streaming
Adopt

Java 21's virtual threads (Project Loom) are the right concurrency model for AI orchestration code. They replace the need for reactive programming when making multiple concurrent LLM API calls, and are already the default in Spring Boot 3.2+ and Quarkus 3.x.

Why It's in Adopt

LLM API calls are I/O-bound and slow (hundreds of milliseconds to several seconds). Before Java 21, concurrent fan-out code either used reactive frameworks (WebFlux, RxJava) — complex, hard to debug — or created fixed thread pools that limited parallelism.

Virtual threads eliminate this trade-off. You write simple blocking code; the JVM suspends the virtual thread when it blocks on I/O and runs something else on the carrier thread, transparently. You get reactive-level throughput with synchronous-style code.

Enabling Virtual Threads

Spring Boot 3.2+: One property:

spring.threads.virtual.enabled=true

That's it. All @Async, Tomcat/Undertow request threads, and @Scheduled tasks run on virtual threads.

Quarkus 3.x:

quarkus.virtual-threads.enabled=true

Annotate blocking methods with @RunOnVirtualThread.

Plain Java:

var executor = Executors.newVirtualThreadPerTaskExecutor();

The AI Fan-Out Pattern

The most common AI use case: call multiple models/services in parallel and aggregate results.

Before virtual threads (pool-based, complex):

ExecutorService pool = Executors.newFixedThreadPool(10);
// Limited concurrency, blocking → starving the pool
Future<String> f1 = pool.submit(() -> llm1.generate(prompt));
Future<String> f2 = pool.submit(() -> llm2.generate(prompt));

With virtual threads (simple, unlimited concurrency):

try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
    var subtask1 = scope.fork(() -> claude.generate(prompt));
    var subtask2 = scope.fork(() -> gpt.generate(prompt));
    var subtask3 = scope.fork(() -> gemini.generate(prompt));

    scope.join().throwIfFailed();

    return Results.of(subtask1.get(), subtask2.get(), subtask3.get());
}

Three concurrent LLM calls, fail-fast error handling, clean resource management — ~15 lines replacing what previously needed a reactive pipeline.

Performance Reality

  • Throughput: Virtual threads allow millions of concurrent I/O-bound tasks — the bottleneck becomes your LLM rate limits, not your thread pool
  • Latency: No change to per-call latency (that's determined by the model)
  • Memory: Stacks stored on heap, ~few KB per virtual thread vs. ~1 MB per platform thread
  • CPU: Negligible overhead — the JVM scheduler is efficient

For a typical AI service handling 1,000 concurrent requests, each making 3 LLM API calls: old model needed 3,000 platform threads (3 GB stack memory); virtual threads need ~20 carrier threads.

When Reactive Is Still the Right Call

Virtual threads replace reactive programming for most AI orchestration, but not all:

  • Backpressure-heavy pipelines. If your system ingests a stream of events faster than your LLM can process them (e.g., processing a Kafka topic of user messages through an AI pipeline), reactive's built-in backpressure (Reactor's onBackpressureDrop, limitRate) is purpose-built for this. Virtual threads give you unlimited concurrency, which is the opposite of what you want when you need to shed load gracefully.
  • SSE/WebSocket streaming with complex operators. If you're composing multiple reactive streams — merging, zipping, or windowing token streams from different models — Reactor's operator library is far richer than anything you'd hand-build with virtual threads. Virtual threads handle "wait for I/O" well; they don't help with "combine three async streams into one."
  • Existing reactive codebases. If you already have a mature WebFlux application with hundreds of reactive endpoints, migrating to virtual threads is a rewrite, not a toggle. The spring.threads.virtual.enabled=true property only helps for new blocking code — it doesn't retroactively simplify your existing Mono<> chains.

The rule of thumb: If your code is "call API, wait, use result" (which most AI orchestration is), use virtual threads. If your code is "merge streams, apply backpressure, window events," keep reactive.

Caveats

Pinned carrier threads: synchronized blocks and some native methods prevent carrier thread release during blocking. Use ReentrantLock instead of synchronized in hot paths. Spring and most modern libraries have been updated to avoid this, but third-party JDBC drivers or legacy code may still pin. The JDK team is actively working on eliminating pinning in future releases (JEP 491, targeted for JDK 24+).

Structured concurrency is still in preview. The StructuredTaskScope API shown above (JEP 462) is a preview feature as of JDK 24. The API surface may change. For production code today, use Executors.newVirtualThreadPerTaskExecutor() with try-with-resources, which is final and stable.

Key Characteristics

Property Value
Available since Java 21 (LTS)
Spring Boot support 3.2+ (one property)
Quarkus support 3.x (one property + annotation)
Replaces Reactive programming for I/O-bound concurrency