Optimizing Workflows at Scale

You can watch this talk on YouTube in Russian and English (auto-dubbed).

Nikolai Sidiropulo’s talk focused on a topic that often gets less attention than model quality or token cost: performance at scale. His central point was that as AI systems move from experimentation into production, teams eventually reach a stage where optimization is no longer optional. Once there is real business value, real user volume, and real operational dependency on these workflows, latency and workflow performance become business issues rather than just technical ones.

He framed optimization around three familiar software dimensions: reliability, efficiency, and performance. In AI systems, reliability is widely discussed because of non-determinism and the need for evals, prompt tuning, and output quality checks. Efficiency is also visible because token usage translates directly into dollars. But performance, he argued, is often under-discussed — even though in many real environments it determines whether the system is truly usable. A workflow that is technically correct but too slow can still become a bottleneck for the customer’s entire operation.

To make that point concrete, Nikolai described a healthcare-oriented workflow where a nurse uses a system during patient discharge. In this kind of environment, the user is not casually multitasking while waiting for the AI to finish in the background. The process is linear, operational, and time-sensitive: the nurse stands there waiting for the system to complete its work before moving on. In such scenarios, even an extra minute of latency compounds quickly across many users and many cases. What looks like a small delay at the system level becomes a throughput problem at the organizational level.

A major focus of the talk was instrumentation and observability. Nikolai emphasized that assumptions about performance are often wrong unless they are validated against actual production data. Local benchmarks and isolated tests rarely reflect the complexity of real-world usage. His recommendation was to instrument critical workflows, use tracing systems such as OpenTelemetry and platforms like Langfuse, and analyze data through percentiles rather than averages. In particular, he stressed the importance of tail latency — looking at P90, P95, or P99 behavior rather than just median response times — because the worst-performing slice of requests often reveals the most meaningful bottlenecks.

One of the strongest practical insights from the session was the need to enrich traces with business-specific attributes, not just standard technical metadata. In the example he shared, adding contextual information such as state or region made it possible to identify asymmetric performance patterns. Some workflows were slower in certain locations because the agent was dynamically pulling extra regulatory context from external sources. Once that pattern became visible, the team could preload and cache the relevant information instead of forcing the agent to fetch it on demand. This improved latency, reduced token waste, and opened the door to better model choices at lower cost.

During Q&A, the audience asked for concrete examples of bottlenecks encountered while scaling workflows. Nikolai pointed back to exactly this kind of hidden asymmetry: issues that are not obvious in aggregate metrics but become clear once traces are grouped by meaningful business dimensions such as geography, time of day, or operational load. Another audience question asked what modern AI system architecture looks like in practice. In response, he outlined a progression from simpler systems to more complex ones: first, workflows that can still be solved algorithmically; then single-model-call systems; then more structured pipelines; and finally multi-agent and sub-agent architectures. His answer was careful and pragmatic — the more moving parts a system has, the more points of failure, non-determinism, and compounded risk it introduces.

There was also an interesting discussion around whether DevOps can be seen as a foundation for AI systems. Nikolai’s answer was that many existing software and infrastructure practices are absolutely relevant, but AI introduces a different layer of complexity because behavior is probabilistic rather than strictly deterministic. Observability, cost tracking, and operational discipline still matter, but standard methods do not map perfectly onto AI workloads, particularly when it comes to evaluating quality, understanding latency distributions, and tracing non-deterministic execution paths.

Overall, the talk offered a mature production-focused view of AI systems. Rather than treating optimization as a one-time tuning exercise, Nikolai described it as a recurring loop: instrument, observe, analyze, optimize, and repeat. For teams already operating AI workflows at meaningful scale, it was a strong reminder that the biggest opportunities often appear not in average performance, but in the edge cases and hidden clusters where the business feels the pain first.

Presenter's email: nasidiropulo@gmail.com