AI Is Fast. Infrastructure Is Not: The Latency Problem in Real-Time AI

You can watch this talk on YouTube in Russian and English(auto-dubbed).
The presentation is available here.

Amir Adigamov’s talk focused on one of the most visible gaps in modern AI products: the models are getting faster and more powerful, but many real-time AI experiences still feel slow to the user. His presentation unpacked that contradiction and showed that the real issue often lies not in the model itself, but in the surrounding infrastructure stack.

He began by defining what he meant by real-time AI — systems such as voice interfaces, game-related AI interactions, extended reality scenarios, or any user-facing experience where a person expects an immediate response. In those environments, even modest latency creates friction. The talk argued that while AI engines are improving at an extraordinary pace, the rest of the system often remains constrained by older architectural assumptions and slower data-handling layers.

A major theme of the session was that AI inference is only one stage in a much larger pipeline. Between the user request and the first useful answer, a system may incur network latency, queueing delays, business logic overhead, retrieval and ranking steps, serialization costs, vector database search time, and token generation time. Amir broke down this “anatomy of latency” in detail, explaining how even when the inference engine is fast, the user may still experience substantial delay because of all the surrounding layers. This was especially relevant in retrieval-augmented systems, where pre-processing and retrieval often consume a surprising portion of the response budget before generation even begins.

He also drew a distinction between time to first token and the rest of the generation process. For many systems, the first-token delay is the most important user experience threshold because it determines whether the interaction feels responsive. Once a system begins streaming output, users are more forgiving. But before that point, every step in the stack matters. The talk made it clear that optimizing real-time AI means optimizing the entire system path, not just the model.

Another practical and well-received part of the session covered architectural choices that affect latency. Amir contrasted older patterns such as REST and JSON serialization with faster alternatives like gRPC, protobuf-based communication, and more efficient binary formats. He also discussed memory and hardware design considerations, especially around token generation bottlenecks, and showed why throughput and responsiveness can be constrained by memory access patterns as much as by compute.

One of the strongest concrete examples in the talk involved voice AI architecture. He described a more traditional implementation in which the system waits for a user to finish speaking, then performs retrieval and ranking before responding. He contrasted that with a lower-latency architecture using a semantic cache, a fast-path agent for common requests, and a slower background process that prefetches likely relevant information while the user is still speaking. This kind of split architecture can dramatically reduce perceived latency by avoiding unnecessary waits and making the response path more proactive.

The practical recommendations at the end of the talk tied these ideas together: keep the path to the model short, avoid unnecessary legacy overhead, keep warm state where it matters, use caching aggressively when possible, and stream responses early instead of waiting for the whole pipeline to complete. The broader message was clear: users do not interact with isolated models — they interact with systems. And if those systems are architected poorly, no model alone will save the experience.