1. Executive Summary
LLM-based multi-agent systems have moved from research prototypes to production infrastructure. A well-designed agent system decomposes complex, multi-step goals into specialized sub-tasks handled by purpose-built agents, each with a focused role, bounded authority, and clear communication contract. The result is a system that can tackle tasks no single model call could accomplish reliably: long-horizon research, iterative code generation, multi-source data synthesis, and any workflow requiring memory, tool use, and adaptive decision-making.
This document defines a production-ready architecture for deploying LLM-based multi-agent systems on Google Kubernetes Engine (GKE). It covers the full engineering stack: prompt construction, context window management, the agent harness, memory systems, orchestration patterns, reliability controls, security, observability, and the framework landscape as of mid-2026. It is framework-agnostic in principle but references specific open-source frameworks — including OpenClaw, DeerFlow, LangGraph, AutoGen, CrewAI, and Nanobots as design references.
Design Goals
- Reliability over capability: A system that consistently delivers correct, verifiable outputs at 95% accuracy is more valuable than one that occasionally achieves 99% but fails unpredictably.
- Harness-first thinking: The agent harness — not the underlying model — is the binding constraint on production agent quality. Model selection matters less than how the system wraps, validates, and recovers from model calls.
- Bounded autonomy: Agents operate within explicitly declared authority levels. No agent takes consequential action without either deterministic authorization logic or human approval.
- Cost-conscious scaling: Tiered model routing, prompt caching, and context compression reduce inference costs by 60–80% compared to naïvely using frontier models for all tasks.
- Full auditability: Every agent decision — reasoning trace, retrieved context, tool calls, outputs, validation results — is logged immutably and indexed for replay.
2. The 2026 LLM Agent Paradigm
The discipline of building reliable LLM agents has passed through three distinct paradigm shifts since 2022. Understanding this evolution is essential for making the right architectural investments.
Prompt Engineering
“What should I say to the model?” Practitioners focused on instruction quality, zero-shot vs. few-shot phrasing, and chain-of-thought formatting. The model was treated as a black box to be coaxed.
Context Engineering
“What information should fill the context window?” As RAG matured, it became clear that what the model sees matters more than how instructions are phrased. Context window management became a first-class engineering discipline.
Harness Engineering
“What system should I build around the model?” The harness — execution environment, tool integrations, memory, retry logic, guardrails, output validation — is now understood as the binding constraint on agent reliability, not the model itself.
The Agent as a System
A single LLM agent is best understood not as “a model with a prompt” but as a closed-loop reasoning system composed of five interacting subsystems:
- Perception: What the agent receives — user input, tool results, retrieved memory, inter-agent messages.
- Context assembly: How the agent’s working context is constructed from perception inputs, constrained by the token budget.
- Reasoning: The LLM inference call that produces structured outputs, tool calls, or sub-task delegations.
- Action: Tool execution, state mutation, or message dispatch — always mediated by the harness, never directly by the model.
- Memory: What persists across turns and sessions — short-term context, long-term knowledge, episodic decision history.
Agent quality degrades when any of these subsystems is under-engineered. The most common failure mode is treating context assembly and harness as afterthoughts while over-investing in prompt wording.
3. Agent Design Patterns & Taxonomy
Rather than designing agents around specific domain tasks, production systems should be built from a small vocabulary of generic agent archetypes. Domain specialization is delivered through the agent’s system prompt, knowledge base, and tool set — not through bespoke code per use case.
Core Agent Archetypes
| Archetype | Primary Role | Authority Level | Typical Model Tier | Latency Class |
|---|---|---|---|---|
| Coordinator / Orchestrator | Receives goals, decomposes them into sub-tasks, routes to specialist agents, tracks overall progress | System-level | Mid-tier (Sonnet class) | Fast (routing overhead only) |
| Planner | Produces a structured execution plan from a high-level goal; selects tools, agents, and ordering | Plan only — no execution | High-tier (Opus/GPT-4 class) | Moderate (seconds) |
| Executor | Carries out atomic, well-scoped tasks assigned by Planner; uses tools; produces structured outputs | Autonomous within scope | Mid-tier (Sonnet class) | Fast to moderate |
| Critic / Verifier | Reviews outputs from Executor agents; identifies errors, hallucinations, or constraint violations | Block output on failure | Low-to-mid tier (Haiku/Sonnet) | Fast (verification pass) |
| Synthesizer | Combines outputs from multiple Executor agents into a coherent final result; resolves conflicts | Human review for high-stakes output | High-tier (Opus class for complex synthesis) | Moderate (long context) |
| Memory Consolidator | Runs on a schedule or during idle periods; compresses short-term memory into long-term vector storage | Fully autonomous | Low-tier (Haiku class) | Async / background |
| Retrieval Agent | Manages knowledge retrieval — query expansion, hybrid search, re-ranking — on behalf of other agents | Read-only | Low-tier (Haiku class) | Fast (<500ms) |
Authority Levels
Authority levels are enforced by the harness, not the prompt. A prompt-only authority declaration can be overridden by prompt injection or model drift; harness-enforced authority cannot.
- Fully Autonomous: Output is delivered or action is taken directly. Reserved for low-risk, reversible operations within a pre-authorized scope (e.g., reading data, sending a draft message for review).
- Human-in-Loop Required: The agent produces a decision and reasoning trace; a human must explicitly approve before any action is taken. Applied to all irreversible or high-impact actions.
- Advisory Only: Agent output feeds a human dashboard or downstream system but is never acted upon autonomously. The agent has no tool access that mutates state.
Coordination Patterns
| Pattern | Structure | When to Use | Trade-offs |
|---|---|---|---|
| Hierarchical (Supervisor) | Coordinator owns the plan; specialists report back. Tree-shaped execution graph. | Complex multi-step tasks; compliance-critical workflows where a clear chain of authority is required | Single point of coordination; auditable; coordinator becomes a bottleneck at high concurrency |
| Sequential Pipeline | Agent A → Agent B → Agent C. Output of each stage is input to the next. | Workflows with clear stage boundaries — research → draft → review → publish | Simple to reason about; brittle if early stages fail; no parallelism |
| Parallel Fan-out / Fan-in | Coordinator dispatches to N agents simultaneously; Synthesizer aggregates results. | Independent sub-tasks that can be executed concurrently — searching multiple sources | Fast; requires robust conflict resolution; output quality depends on Synthesizer |
| Peer-to-Peer / Debate | Agents communicate directly; disagreements surface via structured debate rounds. | Tasks benefiting from adversarial review — code review, risk assessment, essay critique | High quality from diverse perspectives; harder to audit; non-deterministic execution graph |
| Event-Driven (Reactive) | Agents subscribe to event streams; triggered by external events. Pattern used by OpenClaw. | Long-running autonomous assistants; monitoring and alerting workflows | Highly responsive; complex state management; harder to bound execution scope |
Inter-Agent Communication Contract
Regardless of coordination pattern, every inter-agent message must carry a structured envelope with: originating agent identity, correlation/trace ID, the specific question or sub-task being delegated, the accumulated decision context so far, a confidence score, and a time-to-live. The harness enforces a maximum delegation depth (typically 2–3 hops) before escalating to the Coordinator, preventing unbounded recursive delegation.
When to Use Multi-Agent (and When Not To)
Staged Evolution (Recommended)
- Stage 1 — Single agent + tools: One agent with RAG, memory, tool calling, and a workflow engine. Handles the majority of enterprise use cases. Low operational burden, easy to audit, straightforward to debug.
- Stage 2 — Add a Critic/Validator agent: When output quality concerns justify it, add a separate validation pass. This single addition captures most of the reliability benefit of multi-agent architectures at a fraction of the coordination overhead.
- Stage 3 — Add Planner + parallel Executors: Only when tasks are reliably too large for one context window, have independently parallelizable sub-tasks, or require distinct access permission scopes per sub-task.
- Stage 4 — Full multi-agent system: For large-scale autonomous workflows where Stages 1–3 are demonstrably insufficient. This document’s full architecture applies at this stage.
Decision Framework
| Situation | Recommended Architecture | Reason |
|---|---|---|
| CRUD automation, simple Q&A, data lookup | Single agent + tools | Multi-agent adds coordination overhead with no accuracy benefit; single agent with RAG is faster and cheaper |
| Customer support, triage, routing | Single agent + tools | Sequential, well-bounded tasks; one well-prompted agent with retrieval outperforms a poorly coordinated swarm |
| Document summarization, drafting, editing | Single agent (+ optional Critic) | One context window is sufficient; add Critic only if quality rejection rate is unacceptably high |
| Code generation and review | Single agent + optional Executor/Critic split | MetaGPT-style multi-agent is justified only for large-scale autonomous software engineering; not for standard code tasks |
| Multi-source research and synthesis | Multi-agent: Planner + parallel Executors + Synthesizer | Parallel retrieval across independent sources provides genuine speed and coverage benefit that a single agent cannot match |
| Long-horizon autonomous workflow (hours to days) | Multi-agent with Temporal workflow engine | Context overflow across a single agent; durability requirements; sub-tasks can run independently and asynchronously |
| Tasks requiring different data access scopes per sub-task | Multi-agent | Access permission isolation per agent is a genuine security and compliance requirement, not a design preference |
| Monitoring across many independent sources | Multi-agent with event-driven pattern (OpenClaw) | Fan-out monitoring genuinely benefits from parallel specialized agents reacting to independent event streams |
4. Prompt Engineering
Prompt engineering is the deliberate design of the instructions, constraints, and examples that constitute an agent’s behavioral specification. While the context window is broader than the prompt alone, the prompt provides the agent’s stable identity — the constitution that governs how it interprets every piece of context it receives.
System Prompt Architecture
All agent system prompts follow a four-block hierarchy with clearly marked delimiters. This structure is consistent across all agent types — only the content of each block changes per agent. User input and retrieved context are never placed in the system prompt; they arrive in separate, clearly labeled turns or context blocks, preventing the model from conflating instructions with data.
- Role block: Declares the agent’s identity, single primary responsibility, and explicit scope boundary. Instructs the agent to refuse requests outside its scope rather than attempt them.
- Rules block: Contains absolute, non-overridable constraints — behaviors that must hold regardless of any instruction in the conversation, including user input. Examples: “always cite sources for factual claims,” “never take irreversible actions without explicit approval,” “when uncertain, escalate rather than fabricate.”
- Context block: Populated at runtime by the retrieval pipeline. Contains relevant knowledge base chunks, retrieved documents, few-shot examples, and any agent-specific state. This block is dynamic and varies per request.
- Task block: Specifies the current request and the required structured output format, including schema constraints the model must satisfy.
Few-Shot Strategy
Few-shot examples dramatically improve structured output quality and constraint adherence. The key design decision is whether examples are static (hardcoded in the system prompt) or dynamic (retrieved from the knowledge base at runtime).
- Dynamic retrieval is strongly preferred for production systems. Examples are stored in the knowledge base and retrieved by semantic similarity to the current request. This allows the example pool to be updated, curated, and improved without a deployment.
- Negative examples are as important as positive ones. Show the model what a bad or out-of-scope response looks like, paired with the correct refusal or escalation.
- “I don’t know” examples are mandatory for any agent operating in a domain with bounded knowledge. The model must see examples where the correct answer is explicit escalation or acknowledgment of uncertainty — not a fabricated response.
- Format examples as structured reasoning traces (chain-of-thought), not just input/output pairs. This trains the model to show its work, which is essential for validation and audit.
Prompt Versioning & Governance
System prompts are first-class software artifacts. They must be version-controlled, reviewed, tested, and deployed with the same rigor as code.
- Prompts are stored in a Git-based registry as versioned, tagged files — reviewed via pull request by the AI platform team before merging.
- Each deployed agent references a specific prompt version. The harness rejects agent startup if the referenced prompt version is not in the approved registry.
- Changes to the rules block require a higher-level approval gate than changes to the task or context blocks.
- New prompt versions are validated against the full regression test suite before promotion to production.
- Deployed prompts are loaded via configuration at agent startup; hot-reload is supported without pod restarts for low-risk updates.
Prompt Injection Mitigation
| Attack Vector | Mitigation |
|---|---|
| Direct injection in user input (“ignore previous instructions”) | All user input is delivered in a separate, clearly labeled turn. The rules block explicitly instructs the model that user-turn content is data to be acted upon, not instructions. Jailbreak patterns are detected by embedding similarity against a known-attack library before the request reaches the LLM. |
| Indirect injection via processed documents (emails, web pages, files) | All externally retrieved content is injected in distinctly labeled source blocks. The model is instructed that content inside source blocks represents external data and cannot override system rules. |
| Role-play / persona bypass | A post-generation output validation layer checks whether the model’s response violated any declared rule. Violations trigger a discard-and-regenerate cycle (max 2 retries), after which the request is escalated. |
| Knowledge base poisoning | Knowledge base entries are stored with cryptographic integrity hashes. On retrieval, hashes are re-verified; mismatches quarantine the entry and trigger an alert. All knowledge base writes require an authenticated, audited write path — agents only have read access. |
5. Context Engineering & Window Management
Context engineering is the discipline of deciding what information to place in an agent’s context window, how much space to allocate to each category, how to compress when the budget is exceeded, and how to keep the most relevant information at the top of the window where models attend most strongly.
Research consistently shows that context quality — the relevance and density of information in the window — has a larger impact on output quality than prompt wording. Filling the window with low-relevance retrieved chunks actively degrades performance. Context engineering is therefore not a minor optimization; it is central to agent reliability.
Context Window as a Managed Resource
Every agent’s context window is divided into explicit, budget-controlled slots. The prompt assembly layer enforces these budgets before each LLM call — never relying on the model to manage its own attention.
| Agent Archetype | System + Rules | Retrieved Context | History | Output Reserve | Recommended Total Budget |
|---|---|---|---|---|---|
| Coordinator / Orchestrator | 2,000 | 1,500 | 1,500 | 1,000 | 6,000 |
| Executor (standard task) | 2,000 | 3,000 | 1,000 | 1,000 | 7,000 |
| Critic / Verifier | 1,500 | 4,000 | 500 | 500 | 6,500 |
| Synthesizer (multi-source) | 3,000 | 20,000 | 2,000 | 5,000 | 30,000 |
| Long-context analysis agent | 4,000 | 150,000 | 2,000 | 4,000 | 160,000 (Opus / Gemini 1.5) |
| Triage / routing agent | 1,000 | 500 | 500 | 250 | 2,250 |
Context Assembly Pipeline
Before each LLM call, the context assembly pipeline runs in this order:
- Query analysis: Extract intent, entities, and required knowledge domains from the incoming request.
- Parallel retrieval: Simultaneously execute sparse keyword search (BM25) and dense embedding search over the knowledge base. Run graph queries for entity relationships where applicable.
- Re-ranking: Apply Reciprocal Rank Fusion (RRF) to merge sparse and dense results. Apply a domain-specific re-ranker model if available. Select the top-K chunks by relevance score.
- History compression: If conversation history exceeds the allocated slot, run a lightweight summarization pass (using a low-cost model) to condense older turns.
- Slot assembly: Fill each slot in priority order: rules → retrieved context → compressed history → current request. Enforce byte-level token counts per slot.
- Overflow handling: If assembled context still exceeds the model’s window after compression, drop lowest-scoring retrieved chunks first, then older history. If after two rounds the window is still exceeded, reject the request and escalate — never silently truncate.
Retrieval Quality
- Embedding model selection: Choose an embedding model trained on content similar to the target domain. General-purpose embeddings perform well for broad tasks; domain-specific fine-tuned embeddings improve recall on specialized corpora.
- Chunk sizing: Chunks of 256–512 tokens generally outperform longer chunks for retrieval precision. Long documents should be chunked with sentence-boundary awareness and overlapping windows to avoid splitting reasoning context.
- Query expansion: A lightweight pre-retrieval step generates 2–3 query variants (paraphrases, related terms) and retrieves against all variants. Dramatically improves recall for ambiguous or terse user queries.
- Contextual compression: After retrieval, a compression pass extracts only the sentences from each chunk that are directly relevant to the query, reducing noise in the context window without losing key information.
Long-Running Agent Context Management
Agents that operate over extended periods (hours to days) face a unique challenge: context accumulated across thousands of turns cannot fit in any context window. Frameworks like OpenClaw address this with a Memory Dreaming pattern — during idle periods, a background consolidator model distills short-term conversation history into structured long-term memory entries, which are then retrievable via semantic search. This pattern is recommended for any agent designed to maintain coherence across sessions.
A complementary approach, used by Claude Code and similar developer agents, is a virtualized filesystem: the agent externalizes working state to structured artifacts (plan files, task lists, decision logs) that persist independently of the context window and are selectively loaded into context as needed.
6. Agent Harness Design
The agent harness is the execution layer that wraps every LLM inference call. It is responsible for context assembly, tool execution, output validation, retry logic, memory persistence, and observability instrumentation. In 2026, industry consensus has converged on a key principle: the harness is the binding constraint on agent reliability, not the underlying model.
Harness Responsibilities
| Responsibility | Description |
|---|---|
| Context Assembly | Orchestrates the retrieval, compression, and slot-filling pipeline described in Section 5. Enforces token budgets before the LLM call is made. |
| Model Routing | Selects the appropriate model tier based on task complexity, context size, authority requirements, and cost budget. Manages fallback to secondary models when primary is unavailable. |
| Tool Authorization | Validates that the requested tool call is in the agent’s declared allowlist. Checks parameter schemas. Rejects unauthorized tool calls before execution. |
| Tool Execution Sandbox | Executes tools in an isolated environment (container, WebAssembly module, or subprocess). Enforces timeouts, memory limits, and network egress restrictions. Tool results are sanitized before injection into the next context. |
| Output Validation | Validates model outputs against declared schemas and constraint rules. Runs PII detection. Computes confidence signals. Triggers re-generation or escalation on validation failure. |
| Retry & Recovery | Manages transient failure recovery: exponential backoff on API errors, fallback to secondary models on provider failure, context sanitization and restart on persistent validation failures. |
| Memory Persistence | Writes decision records, reasoning traces, and state updates to the appropriate memory tier before returning the response. |
| Observability Instrumentation | Emits structured spans for every harness operation: model call, tool execution, retrieval, validation. Tracks token counts, latency, cost, and confidence at every step. |
The Agent Reasoning Loop
The harness implements the agent’s core reasoning loop — a multi-turn cycle that runs until the model produces a final text-only response (no further tool calls):
- Assemble context from memory, retrieval pipeline, and current conversation state.
- Call the LLM with the assembled context.
- If the model returns a tool call: validate authorization → execute in sandbox → inject result → return to step 1.
- If the model returns a delegation request: emit message to target agent queue → await response (async) or block (sync, with timeout) → inject result → return to step 1.
- If the model returns a final response: run output validation → check confidence → persist to memory → deliver or escalate.
- If any step produces an unrecoverable error after retries: emit escalation event → persist failure record → return error response.
Bounded, Deterministic Workflows
Production experience across the industry has strongly validated the Supervisor Pattern over unconstrained agent swarms. Key principles:
- Phase-gating: Complex workflows are divided into explicit phases (plan → gather → draft → review → deliver). No phase begins until the previous phase’s output has been validated. This prevents cascading errors from propagating through the entire workflow.
- Bounded loops: Every reasoning loop has a hard maximum iteration count enforced by the harness, not the prompt. Unconstrained loops are the most common cause of runaway cost and infinite regress.
- Explicit termination conditions: The harness defines success and failure criteria independently of the model. A task completes when the harness validates the output satisfies the success criteria — not when the model claims it is done.
- Prefer simpler graphs: Resist the temptation to model every possible execution path. A flat, sequential pipeline with explicit branching at known decision points is more reliable and easier to debug than a fully dynamic agent graph.
Model Routing
The harness selects the most cost-efficient model that meets the task’s requirements at runtime:
| Condition | Model Tier | Rationale |
|---|---|---|
| Triage, routing, simple lookup, PII scanning | Low tier (Haiku class) | Lowest cost; fast; more than sufficient for classification and simple generation tasks |
| Standard analysis, structured output, multi-step reasoning | Mid tier (Sonnet class) | Best accuracy/cost balance; suitable for the majority of agent tasks |
| Complex synthesis, long-document analysis, high-stakes decisions | High tier (Opus / GPT-4 class) | Highest reasoning quality; use only where lower tiers demonstrably fail |
| High-volume batch workloads, cost-sensitive, non-real-time | Self-hosted open-source (Llama class via vLLM) | Near-zero per-token cost after GPU amortization; data does not leave the cluster |
Tool Integration via MCP (Model Context Protocol)
The Model Context Protocol (MCP) has become the standard abstraction for connecting agents to external tools, data sources, and capabilities. Rather than each agent implementing bespoke tool integrations, MCP-compatible servers expose capabilities through a standardized interface that any MCP-aware harness can discover and invoke. This approach is now adopted by Anthropic’s Claude, OpenAI, Hermes Agent, Nanobots, and major enterprise platforms.
- MCP servers: Each tool category (database access, file system, web search, API calls, code execution) is exposed as a separate MCP server with a declared capability manifest. Agents discover available tools from the manifest at startup — no hardcoded integrations in the harness.
- MCP governance: Server manifests declare required permissions and data access scopes. The harness validates that the requesting agent’s allowlist includes the specific MCP server and capability before establishing a connection. Unsigned or unregistered MCP servers are rejected at the harness layer.
- MCP security model: Each MCP server runs in its own sandboxed process with declared network egress, filesystem access, and API scopes — enforced by the container runtime, not by the server’s own code. Tool results returned from MCP servers are sanitized (size limits, executable content stripped) before injection into agent context. A compromised MCP server cannot affect the harness process.
- MCP in production on GKE: Host each MCP server as a separate GKE Deployment with its own service account, resource limits, and network policy. This enables independent scaling, versioning, rollback, and security policy per tool category — far more operationally manageable than bundled tool integrations.
7. Memory Architecture
Agent memory is organized into three tiers by temporal scope, with distinct storage backends and access patterns for each. A well-designed memory system lets agents maintain coherence across long sessions and leverage accumulated knowledge without ever overflowing their context window.
Memory Tiers
| Tier | Scope | Storage | TTL / Retention | What Is Stored |
|---|---|---|---|---|
| Short-Term (Working) | Current session / task | In-process state + Redis | Session lifetime | Active conversation turns, intermediate reasoning steps, tool results from this session |
| Episodic (Mid-Term) | Recent interactions | Redis (hot) + Vector DB (warm) | Days to weeks | Summarized past sessions, recent decisions with outcomes, user preferences and corrections |
| Semantic (Long-Term) | Persistent knowledge | Vector DB + Knowledge Graph | Indefinite (versioned) | Domain knowledge base, policy documents, curated few-shot examples, entity relationships |
Short-Term Memory Management
The default working context retains the most recent N turns, where N is determined by the agent’s allocated history slot in the token budget. When the history slot fills to 70% of capacity, the harness automatically triggers a summarization pass using a low-cost model, condensing older turns into a compact summary object that replaces them in context. This keeps the context window from filling with stale conversational scaffolding while preserving the essential reasoning thread.
Long-Term Memory & Memory Dreaming
Inspired by OpenClaw’s Memory Dreaming architecture, long-term memory consolidation runs as a background process during idle periods. A lightweight consolidator model reviews the agent’s recent short-term memory, extracts key facts, decisions, preferences, and outcomes, and writes structured entries to the long-term vector store. This mimics biological memory consolidation during sleep and enables agents operating over days or weeks to maintain coherence without ever exceeding their context window.
All consequential decisions — task completions, escalations, tool call outcomes, user approvals — are persisted as immutable records to an append-only decision log before the response is returned.
Storage Architecture
| Store | Technology | Purpose |
|---|---|---|
| Hot cache | Redis 7 | Active session state, embedding cache, rate limit counters. LRU eviction keeps the most-used context in memory. |
| Vector store | pgvector (PostgreSQL extension) | Semantic search over knowledge base and episodic memory. HNSW index for ANN search. Zero new infrastructure if PostgreSQL is already in the stack. |
| Knowledge graph (Optional) | Neo4j | Entity relationships, multi-hop reasoning across connected data. Add only when entity relationships are a proven core requirement and vector search alone demonstrably fails to capture them. Carries real operational cost: separate infrastructure, backup strategy, consistency model, and expertise. Many production deployments never require this layer — start without it and add only when justified by evidence. |
| Relational store | PostgreSQL | Audit log, decision history, agent registry, model registry. Append-only tables with hard deletes disabled at the database layer. |
| Cold archive | Object storage (GCS) | Raw reasoning traces, long-term data retention beyond warm-tier TTL. |
Knowledge Base Governance
Knowledge base entries are version-controlled with effective-from and effective-until markers. Superseded entries are retired (excluded from retrieval) but never deleted — they remain accessible for audit and historical context. All writes to the knowledge base require an authenticated, audited write path; agents have read-only access. Knowledge base entries are stored with content integrity hashes; retrieval re-verifies hashes before injecting content into agent context.
8. Multi-Agent Orchestration
Recommended Pattern: Hierarchical Supervisor
For the majority of enterprise agent deployments, the hierarchical supervisor pattern is the right choice. A Coordinator agent receives all inbound requests, classifies intent, and routes to the appropriate specialist agents. Specialists may sub-delegate once (at most twice) before returning results to the Coordinator. This produces a tree-shaped execution graph that is auditable, debuggable, and controllable.
Workflow Engine
Complex multi-step agent workflows (involving tool calls, human-in-loop waits, retries, and conditional branching) must be backed by a durable workflow engine — not simple async code. The workflow engine guarantees that workflows survive process crashes, network failures, and model errors by replaying from the last successful checkpoint. Apache Temporal is the recommended engine for GKE deployments; it supports:
- Durable, retryable activities with exponential backoff.
- Long-running human-approval activities that pause indefinitely until a human approval signal is received.
- Full execution history as an immutable audit trail — Temporal’s event history is the authoritative record of what each agent did and when.
- Deterministic workflow replay for debugging and investigation.
Communication Protocol
- Asynchronous (Kafka): Default for high-volume internal agent-to-agent messaging within the cluster. A dedicated topic per agent archetype, with consumer group scaling. Messages carry the full delegation envelope (Section 3). Dead-letter queues capture messages that fail processing after the maximum retry count.
- Synchronous (gRPC / HTTP): Used only when the Coordinator must block on a sub-agent result before proceeding — e.g., a fast lookup on the critical path. Hard timeout enforced; failure routes to the async fallback path.
- A2A Protocol: The interoperability layer for agent-to-agent communication across framework and vendor boundaries. See below.
Agent-to-Agent Interoperability: A2A Protocol
The Agent-to-Agent (A2A) Protocol is the Linux Foundation open standard for inter-agent communication. Originally developed by Google and donated to LF AI & Data, A2A v1.0 was released in early 2026 with over 150 supporting organizations: Google, Microsoft, AWS, Oracle, Databricks, Snowflake, Salesforce, ServiceNow, SAP, and IBM. It is natively integrated into Google ADK, Microsoft Agent Framework, LangGraph, Semantic Kernel, CrewAI, Amazon Bedrock AgentCore, and Azure AI Foundry. Financial services is a confirmed production adoption vertical.
A2A and MCP have distinct, complementary roles — one is not a replacement for the other:
| Protocol | Connects | What It Handles |
|---|---|---|
| MCP (Model Context Protocol) | Agent → Tool / Data Source | How an agent calls a database, API, file system, or code execution environment. Defined in Section 6. |
| A2A (Agent-to-Agent Protocol) | Agent → Agent | How a Coordinator delegates to a specialist agent; how agents built on different frameworks discover and interoperate with each other. |
Key A2A production properties relevant to this architecture:
- Signed Agent Cards: Each agent publishes a cryptographically signed capability manifest (Agent Card) declaring its skills, required permissions, and communication endpoint. Agents verify cards before accepting delegated tasks — preventing spoofing and capability misrepresentation.
- Cross-framework interoperability: A Coordinator built on Google ADK can delegate to a specialist agent built on LangGraph over A2A without bespoke integration code. This is significant for organizations that run multiple teams with different framework preferences, and for evolving the architecture incrementally.
- Enterprise security: A2A v1.0 includes authenticated agent discovery, enterprise-grade multi-tenancy, and VPC-compatible deployment — required properties for banking infrastructure.
- Kafka as A2A transport: A2A defines the protocol layer (message envelope, capability discovery, authentication). For high-volume internal messaging within the GKE cluster, Kafka remains the recommended transport implementation, carrying A2A-formatted messages.
Conflict Resolution
When two agents produce contradictory outputs on the same question, the Coordinator runs a synthesis step: both reasoning traces are presented to a high-tier model with the explicit instruction to identify the root cause of disagreement. If the disagreement is resolvable by evidence quality (one agent has higher-confidence retrieval), the higher-confidence output wins. If it is genuinely unresolvable, the decision is escalated to a human, and both outputs plus the synthesis analysis are presented together. Human resolutions are logged and fed back into the labeled training dataset.
State Consistency
Shared state (task status, accumulated results, entity state) lives exclusively in the authoritative data store. Agents do not cache shared state locally between decisions — they query the source of truth on each reasoning step. Optimistic concurrency control prevents race conditions when multiple agents act on the same entity simultaneously. The workflow engine checkpoints task progress so that any agent can be restarted and resume from the last known-good state.
9. Hallucination & Reliability Controls
Grounding Requirements
Any agent output that makes a factual claim must cite the source identifiers from the knowledge base that support that claim. The output validation layer rejects outputs that make factual assertions without citations before they are delivered. Agents are instructed in the rules block to use explicitly hedged language for all claims (“evidence suggests,” “based on retrieved data”) rather than asserting facts as certainties.
Confidence Scoring
A composite confidence score is computed from objective, externally verifiable signals. Any signal below its individual threshold contributes to a lower composite score; the composite falling below 0.75 triggers escalation:
- Citation coverage: Fraction of factual claims in the output supported by a retrieved knowledge base source identifier. Uncited claims directly lower this signal. This is the highest-weight signal.
- Retrieval quality: Mean similarity score of the top retrieved chunks against the query embedding. Low scores indicate the knowledge base may not cover the requested topic — the agent is reasoning from poor grounding.
- Validation pass rate: Whether the output passed the schema validator, constraint checker, and PII scanner on the first generation attempt. Re-generation required → lower score.
- Critic agreement: For decisions reviewed by a Critic agent, the degree of agreement between the Critic’s assessment and the Executor’s output. Disagreement lowers the score proportionally to its severity.
- Historical task accuracy: Rolling accuracy of this agent on the same task type over the past N runs, derived from the audit log and labeled decision corpus — not from the model’s own self-assessment.
When the composite score falls below the configured threshold (default: 0.75), the decision is automatically flagged for human escalation. Threshold values must be calibrated per task type against the labeled decision corpus; a single global threshold is insufficient for systems handling diverse task categories.
Fact Verification Pipeline
- A lightweight extraction pass (using a low-cost model) identifies all factual claims in the agent’s output.
- Each claim is checked against the knowledge base for supporting or contradicting evidence.
- Claims with no supporting evidence: confidence is downgraded; a disclaimer is added; the output is flagged for review.
- Claims directly contradicted by the knowledge base: output is rejected; the incident is logged as a hallucination event; the agent retries with additional grounding context injected.
Self-Correction Loop
After initial generation, a second verification pass checks the output for internal consistency: does the conclusion follow from the cited evidence? Do claims contradict each other? Detected inconsistencies trigger a regeneration request with the specific issues highlighted as additional context. After two failed regeneration attempts, the request is escalated — it is never silently delivered in an inconsistent state.
Human-in-Loop Triggers
- Blended confidence score below the configured threshold
- Any factual claim with no supporting knowledge base citation
- Any irreversible action (deletion, external message send, state mutation on critical data)
- Agent output flagged by the consistency verification pass
- Three consecutive decisions by the same agent with declining confidence (potential drift signal)
- Any decision in a use-case category explicitly marked “human-in-loop required” in the agent configuration
10. Security & Governance
OWASP Top 10 for LLM Applications (2025)
| Risk | Description | Mitigations |
|---|---|---|
| LLM01: Prompt Injection | Adversarial inputs in user data or retrieved content override model behavior | Delimiter isolation between instruction and data channels; jailbreak embedding detection on all user input; post-generation constraint validation; user input is data, never instructions |
| LLM02: Insecure Output Handling | Unsafe agent output is executed downstream (XSS, code injection, SQL injection) | All agent outputs are structured data validated against a schema; no dynamic code generation without sandbox execution; downstream systems treat agent output as untrusted input |
| LLM03: Training Data Poisoning | Compromised training or fine-tuning data degrades model behavior | Fine-tuning datasets curated with human review and full provenance tracking; cryptographic hash verification of training data; prefer managed model APIs over self-trained base models in production |
| LLM04: Model Denial of Service | Excessively long or complex inputs exhaust inference budget | Token budget enforcement before API calls; per-tenant and per-agent rate limits; message queue concurrency limits prevent spike amplification |
| LLM05: Supply Chain Vulnerabilities | Compromised LLM provider, dependency, or plugin | Provider SOC 2 and security certifications verified; all dependencies pinned with automated vulnerability scanning; skills/plugins require signed manifests before installation (OpenClaw ClawHub model) |
| LLM06: Sensitive Information Disclosure | Model leaks PII, credentials, or confidential training data | PII tokenization before data reaches any LLM API; output PII scanner redacts any detected sensitive data before delivery; system prompts never contain real user data |
| LLM07: Insecure Plugin Design | Agent tools expose capabilities beyond their intended scope | Explicit tool allowlist per agent archetype enforced by harness; all tool calls logged and audited; no shell execution tools in production; tool parameters validated against schema before execution |
| LLM08: Excessive Agency | Agent takes actions beyond its intended scope or authority | Authority levels enforced at the harness layer (not prompt); irreversible actions require explicit human approval; agent service accounts have minimal database and API grants |
| LLM09: Overreliance | Operators or end users blindly trust agent outputs without verification | Confidence scores and reasoning traces are always surfaced to human reviewers; UI framing emphasizes “AI-assisted, human-decided”; weekly accuracy reports to governance committee |
| LLM10: Model Theft | System prompts, fine-tuned weights, or proprietary knowledge exfiltrated | System prompts stored encrypted at rest and never exposed in API responses; RBAC on prompt registry; network egress controls on agent pods (no direct internet access without allowlist) |
Model Governance
- Model Registry: All approved model versions are tracked in a governed registry with approval timestamps, approver identity, and declared use-case scope. Agents may only invoke models listed in the registry for their specific use case.
- Rollout Strategy: New model versions follow a canary deployment: 5% traffic → 25% → 100%, with automatic rollback triggered if accuracy or confidence metrics regress more than 5% from the baseline at any stage.
- Prompt Change Control: System prompt changes require a Git pull request reviewed by the AI platform team. Merged prompt versions are tagged, immutable, and must pass the full regression suite before production promotion.
- Fine-tuning Governance: Training datasets for any fine-tuning run are reviewed for quality, bias, and provenance before use. Fine-tuned models must complete the same canary rollout as any model change.
Data Isolation & PII
- Each agent runs with its own service identity; database access is scoped to the minimum required grants for that agent’s declared tool set. Row-level security enforces tenant scoping on every query.
- PII (names, contact details, credentials, any regulated personal data) is tokenized at system ingestion before any agent ever sees it. Detokenization occurs only in the delivery layer, outside the agent reasoning path.
- All agent outputs pass through a PII scanner before delivery or logging. Any detected PII is redacted and an alert is emitted.
Audit Trail
Every agent decision produces an immutable audit record persisted to an append-only log before the response is returned. This record includes the agent’s identity, model version, a hash of the prompt used, identifiers of retrieved context chunks, the model’s full output, all tool calls made, validation results, confidence score, and delivery timestamp. Hard deletes on this log are blocked at the database layer.
11. Observability & Monitoring
Agent Health Metrics
| Metric Category | What Is Measured | Alert Threshold |
|---|---|---|
| Decision latency | P50/P95/P99 end-to-end per agent archetype and model. Realistic P95 targets vary significantly: single agent + tools (3–5s), multi-step agent workflow with retrieval (15–60s), research or synthesis workflow (1–5 min). Set per-agent budgets at design time — a single 3s target across all archetypes is not achievable for multi-agent chains and will produce constant false alerts. | P95 > per-agent configured budget × 1.5 → warn; > 3× → page |
| Error rate | Failed, rejected, or timed-out decisions as a fraction of total | >2% → warn; >5% → page |
| Token consumption | Tokens used per agent, per slot type, per model | Daily budget >80% → warn; >100% → block new requests |
| Inference cost | Cumulative spend per agent, per tenant, per model | Monthly burn rate >80% of budget → warn |
| Confidence score | Rolling mean blended confidence per agent archetype | 1-hour mean <0.75 → warn; <0.65 → page |
| Escalation rate | Fraction of decisions routed to human review | Spike >2× rolling 7-day baseline → investigate |
| Hallucination events | Decisions where the fact-verification pipeline detected a false claim | Any occurrence → immediate alert to AI governance team |
| Cache hit rate | Fraction of context assembly requests served from prompt/embedding cache | <50% hit rate on stable knowledge base → investigate retrieval configuration |
Reasoning Traces & Audit Records
The authoritative audit record for each agent decision is a structured decision artifact — not the raw reasoning trace — persisted to the append-only audit log before the response is returned:
- Agent identity + model version + prompt hash: Which agent, which model, which exact prompt version governed this decision
- Retrieved context identifiers + similarity scores: Which knowledge base sources the agent was grounded on, with retrieval quality signal
- Tool calls and results: Every tool invocation with declared parameters and structured output (never raw execution logs)
- Final output: The actual response delivered to the caller
- Validation results: Schema check, citation coverage score, PII scan result, composite confidence score, Critic assessment if applicable
- Workflow state snapshot: Phase, iteration count, delegation depth, Temporal workflow ID
Raw reasoning traces, when stored for debugging or model improvement purposes, must be written to a separate, encrypted, restricted-access log path. Access requires elevated authorization and is audited. Raw traces must never appear in general application logs, never be returned via API responses, and never be used as the sole basis for audit or compliance evidence.
Model Drift Detection
- Decision accuracy drift: Weekly automated regression against the labeled decision corpus. Alert if accuracy drops more than 5% from the rolling 30-day baseline.
- Semantic drift: Monthly embedding similarity analysis comparing the centroid of agent output embeddings this month against the 3-month rolling average. Significant divergence triggers a governance review.
- Confidence drift: Rolling 7-day mean confidence per agent monitored continuously. A sustained decline from baseline triggers investigation into potential model degradation or knowledge base staleness.
Self-Healing & Auto-Remediation
- Circuit Breakers: Per-LLM-provider circuit breaker. On consecutive timeouts, the circuit opens and traffic routes to the configured fallback model. Circuit probes every 30 seconds; closes after three consecutive successes.
- Failed Agent Recovery: After three consecutive validation failures, the agent is automatically restarted with a sanitized (empty) context. If failures continue post-restart, the agent is isolated and an incident is created.
- Context Corruption Detection: Knowledge base entries retrieved with a hash mismatch are quarantined, excluded from future retrieval, and flagged for manual review. The agent proceeds with reduced context rather than poisoned context.
- Resource Exhaustion: Vector index size and cache memory are monitored continuously. Alerts fire at 70% capacity to allow proactive scaling before exhaustion.
Tooling Stack
- Metrics: Prometheus + Grafana. Custom dashboards for per-agent cost tracking and accuracy burn-down.
- Logs & Traces: Structured JSON to Loki; distributed tracing to Tempo via OpenTelemetry. All LLM spans annotated with model identity, token counts, cost, and confidence.
- LLM-specific monitoring: Arize AI or equivalent (hallucination detection, embedding drift, decision accuracy tracking). Async export — never on the critical inference path.
- Prompt management: Git-based registry; Grafana panels show per-prompt-version accuracy and cost trends to guide promotion decisions.
12. Framework Landscape
For a regulated enterprise (large financial institution) deploying on GKE, framework selection must satisfy additional criteria beyond technical capability: active vendor support, enterprise security posture, longevity guarantees, and regulatory acceptability. The frameworks below are divided into two tiers accordingly. Communication protocols (MCP for tool access, A2A for agent-to-agent — covered in Sections 6 and 8) are framework-agnostic and should be adopted regardless of which framework tier you choose.
Enterprise-Grade Frameworks
Actively maintained, production-proven, with enterprise support and security commitments. Acceptable for regulated financial services deployments.
Google’s enterprise agent framework, GA across four languages in 2026. Workflows are defined as typed directed graphs (nodes = agents or steps; edges = transitions), providing deterministic, auditable execution. Native GKE deployment with official tutorials, Vertex AI integration, Cloud Trace observability, and VPC Service Controls out of the box. Built-in evaluation framework (evalsets, LLM-as-judge). A2A protocol support is first-class. One-command deploy to Vertex AI Agent Runtime, Cloud Run, or GKE Autopilot. The natural choice for GKE-native architectures on Google Cloud.
The production-grade successor that merges AutoGen’s orchestration capabilities with Semantic Kernel’s enterprise plugin system. GA April 2, 2026. Key capabilities: graph-based multi-agent orchestration, persistent session management, OpenTelemetry-native observability, MCP and A2A protocol support. Hosted Agents on Azure AI Foundry provide sub-100ms cold starts, zero idle cost, built-in identity, and automatic scaling. Supports Python and .NET. For organizations on the Microsoft / Azure stack or with existing Semantic Kernel investments, MAF is the forward path.
The most widely deployed production orchestration framework in the LangChain ecosystem. Workflows are modeled as stateful graphs where typed state flows between nodes, branches conditionally, and loops based on intermediate results. State checkpointing enables durable, resumable workflows. Native A2A support for agent-to-agent interoperability. LangSmith provides production observability: traces, evals, and prompt versioning. Google Cloud’s Agent Runtime can deploy LangGraph agents directly as A2A endpoints. Best for complex stateful workflows where explicit state management, conditional branching, and auditability are first-class requirements.
Enables role-based multi-agent systems defined in natural language. Agents are assigned roles, goals, and backstories; a Crew coordinates their work. The open-source edition is best for rapid prototyping. CrewAI Enterprise adds human-in-loop, persistent memory, SSO, and SLA-backed support — required for regulated production deployments. Native A2A support. Note: CrewAI Inc. is a VC-funded startup; evaluate vendor stability and enterprise support terms before committing to a banking production deployment. Prefer prototyping with CrewAI and graduating to ADK or LangGraph for production if vendor stability is a concern.
The internal agent runtime provides a multi-tenant HTTP wrapper,
an MCP-based skill system, a pluggable LLM router, and
composable YAML-defined skill packs loaded dynamically per
tenant. The LLM router selects model tier per request based on
configured rules, enabling centralized cost control and model
governance across tenants. Designed for the internal platform
where multi-tenancy, skill composability, and cost control are
primary requirements. Reference implementation:
apps/agent_runtime/. Can integrate with LangGraph
or ADK for complex workflow orchestration needs.
Design Pattern References
| Framework | Valuable Pattern to Study | Why Not for Direct Enterprise Adoption |
|---|---|---|
| OpenClaw (Peter Steinberger) | Memory Dreaming: background consolidation of short-term memory into long-term vector storage during idle periods. Valuable pattern for cross-session agent coherence. | Personal assistant focus; no enterprise support SLA; not hardened for banking compliance or security audit requirements. |
| DeerFlow (ByteDance) | Hierarchical research orchestration: primary orchestrator + parallel sub-agents + synthesis pattern for multi-source research workflows. | ByteDance vendor origin raises regulatory and data-sovereignty concerns for regulated Western financial institutions. Adopt the orchestration pattern using LangGraph or ADK instead. |
| Hermes Agent (Nous Research) | Persistent memory + RL training from trajectories: running agent as a training-data generator, producing tool-calling trajectories for fine-tuning. | Released February 2026; no enterprise support; not production-hardened. The fine-tuning trajectory pattern is worth studying; the framework itself is not enterprise-ready. |
| MetaGPT (DeepWisdom) | Role-based document-centric workflows: structured artifact production (PRDs → architecture → code → tests) through role-assigned agents. | DeepWisdom vendor origin; niche software engineering focus; not general enterprise use. The artifact-chaining pattern translates well to LangGraph or ADK implementations. |
Framework Selection Guidance
| Priority / Context | Recommended Enterprise Framework |
|---|---|
| GKE-native deployment, Google Cloud stack, multi-language teams | Google ADK — native GKE integration, Vertex AI, Cloud Trace, A2A first-class |
| Microsoft / Azure stack, .NET teams, or migrating from AutoGen / Semantic Kernel | Microsoft Agent Framework (MAF) — the supported successor; do not start new projects on AutoGen |
| Complex stateful workflows, explicit conditional branching, LangChain ecosystem | LangGraph — most widely deployed, strong A2A + observability, deployable on ADK Agent Runtime |
| Rapid prototyping, role-based agent teams | CrewAI (Enterprise edition) — validate vendor stability before production commitment |
| Cross-framework agent interoperability (any combination of the above) | A2A Protocol — framework-agnostic; enables ADK agents to delegate to LangGraph agents and vice versa |
13. Testing Strategy
Unit Testing
- Prompt constraint adherence: Assert that system prompts with adversarial user inputs do not produce outputs that violate declared rules. Use a lightweight mock LLM that returns scripted responses for specific input patterns — not a live API call.
- Context assembly: Assert correct token counts, slot allocation, overflow handling, and summarization triggers for known input sizes. These are deterministic tests that do not require LLM calls.
- Retrieval pipeline: Unit tests for embedding search, BM25 indexing, RRF fusion, and re-ranking logic with fixed fixtures.
- Output validation: Tests for the schema validator, PII scanner, confidence threshold logic, and fact-verification pipeline in isolation.
- Harness authorization: Assert that tool calls not in the agent’s allowlist are rejected before reaching the execution sandbox.
Integration Testing
- End-to-end workflow: Full agent decision flow against real backing services (vector DB, Redis, knowledge base) with a mock LLM via a local OpenAI-compatible server.
- Multi-agent coordination: Test the full delegation chain (Coordinator → Executor → Critic) with mock LLM. Assert correct message routing, delegation depth enforcement, and escalation on conflict.
- Workflow durability: Kill a worker mid-workflow; verify Temporal replays and resumes from the last checkpoint without data loss.
- PII non-leakage: Inject synthetic PII into test inputs; assert it is absent from all message queue events, log entries, and agent outputs.
Hallucination Testing
- Out-of-knowledge-base queries: Present agents with questions whose answers are not in the knowledge base. Assert output contains an explicit uncertainty signal, not a fabricated answer.
- Adversarial prompt library: 50+ curated jailbreak and injection attempts. Assert all are detected and rejected before reaching the LLM API.
- Consistency stress tests: Feed contradictory premises; assert the agent detects the contradiction and escalates rather than synthesizing a fabricated resolution.
- Production regression: Every hallucination event in production generates a new test case. Target: 100% of known failure modes covered.
Performance Testing
- Latency profiling: Measure P50/P95/P99 per agent archetype under 1×, 2×, 5×, and 10× normal load. Identify which component (inference API, retrieval, orchestration) limits throughput.
- Token efficiency: Measure actual vs. budgeted token usage per slot. Any agent consistently using more than 90% of its budget needs configuration review.
- Cache effectiveness: Measure embedding cache and prompt cache hit rates under realistic query distributions. Target >70% for stable knowledge base content.
Canary & A/B Testing
- New prompt versions are deployed to 5% of traffic. Promotion criteria: accuracy ≥95% of control, latency within 110%, cost within 105%, measured over a minimum 48-hour window.
- Model version changes: single-replica canary first. Workflow history provides the ground truth for comparison.
- Retrieval strategy A/B: BM25-only vs. hybrid search — decision quality measured over 2-week windows before committing to a change.
Evaluation Framework
LLM agent evaluation operates at four distinct layers that require separate metrics and tooling. Conflating them produces misleading quality signals — a model that scores well on benchmarks can still produce a failing agent workflow, and a passing workflow can still miss business targets.
| Layer | What Is Evaluated | Key Metrics | Tooling |
|---|---|---|---|
| Model Evaluation | Raw LLM capability independent of agent scaffolding — instruction following, structured output, reasoning quality | Domain-specific benchmark accuracy; token-level calibration; latency; cost per token | MLflow LLM Evaluate; provider eval APIs |
| Agent Evaluation | End-to-end agent behavior: tool selection accuracy, retrieval quality, citation coverage, constraint adherence, output correctness per task type | Task success rate; tool call accuracy; citation coverage; constraint adherence rate; escalation appropriateness rate | MLflow Agent Evaluate; custom judges |
| Workflow Evaluation | Multi-agent coordination quality: delegation accuracy, phase-gate outcomes, conflict resolution, durability under failures, audit trail completeness | Workflow completion rate; phase failure rate; MTTR on failures; Temporal workflow success rate; delegation depth distribution | Temporal workflow history; Grafana dashboards |
| Business Evaluation | Actual impact on business outcomes: how often does the agent produce something a human approves, acts on, or has to correct? | Human approval rate; correction rate; escalation rate vs. baseline; downstream process cycle time; cost per completed workflow including human review | Product analytics; MLflow experiment tracking; Grafana |
LLM-as-Judge
LLM-as-judge uses a separate LLM to evaluate agent output quality at scale, enabling evaluation that would be prohibitively expensive with human reviewers alone. MLflow ships built-in judges for grounding, correctness, safety, relevance, and custom guidelines. Key design rules for reliable judges in production:
- Use a stronger or equal model as judge: A weaker judge cannot reliably evaluate a stronger model’s output. Use the same tier or higher.
- Provide reference answers: Pointwise scoring against a known-good reference is significantly more reliable than comparative ranking between two outputs.
- LLM judges have well-documented biases: Verbosity bias (preferring longer outputs), self-preference bias (preferring outputs resembling their own training style), and sensitivity to prompt template wording are confirmed in research. Use deterministic rule-based metrics alongside LLM judges — not instead of them.
- Calibrate against human labels: MLflow supports human feedback loops where reviewer labels improve judge accuracy over time. Establish this calibration pipeline before relying on judges for production promotion decisions.
- LLM judges cannot detect subtle factual hallucinations: Judges share the same knowledge limitations as the models they evaluate. They are reliable for assessing structure, format, relevance, and citation presence — not for verifying factual correctness in specialized domains.
MLflow Integration
MLflow is the recommended evaluation and experiment tracking platform (30M+ monthly downloads; top-rated agent eval platform in 2026). Key integrations for this architecture:
- Log every agent run as an MLflow experiment with token counts, latency, confidence scores, tool calls, citation coverage, and validation results as tracked parameters.
- Use MLflow’s LLM Evaluate API to run batch evaluations against the labeled decision corpus before promoting new prompt or model versions to production.
- Compare prompt and model versions side-by-side in the MLflow comparison UI — accuracy, cost, latency, citation coverage — enabling data-driven promotion decisions.
- Export human reviewer labels (approve/reject/correct) back into MLflow as ground-truth feedback, closing the loop between production outcomes and offline evaluation accuracy.
- Use MLflow’s experiment tracking for A/B tests on retrieval strategies, prompt variants, and model tier assignments — the comparison must be evidence-based, not intuition-based.
Test Data & Fixtures
- Labeled decision corpus: 500+ decisions with human-approved ground-truth outcomes, maintained in a dedicated regression test store and used for the weekly accuracy regression run.
- Edge case library: Every production incident generates a synthetic test case. Retained indefinitely.
- Synthetic stress data: Programmatically generated inputs covering boundary conditions, unusual-but-valid inputs, and known failure modes.
14. Deployment & Scaling (GKE)
GKE Node Pools
| Pool | Machine Type | Purpose | Min / Max Nodes |
|---|---|---|---|
| default | n2-standard-4 | Agent harness pods, Kafka consumers, Temporal workers, API gateway | 3 / 20 |
| gpu-inference | a2-highgpu-1g (A100) | vLLM self-hosted inference for open-source models (batch workloads) | 0 / 4 (scale-to-zero) |
| memory-optimized | n2-highmem-8 | Vector index serving (pgvector), Neo4j knowledge graph, Redis | 2 / 6 |
Auto-Scaling
Standard CPU/memory HPA is insufficient for LLM agent workloads whose resource usage is driven by model inference latency rather than compute. Custom metrics-based HPA is required:
- Message queue depth per agent topic: Consumer pods scale when backlog exceeds a configured threshold, keeping delegation latency predictable under load spikes.
- P95 decision latency: Additional replicas are provisioned when latency exceeds 80% of the per-agent budget, providing headroom before SLA breach.
- GPU nodes: Scale-to-zero for batch inference pools; a ~3-minute cold start is acceptable for non-real-time workloads. Real-time agents must run on always-on compute.
Google ADK Deployment on GKE
For architectures built on Google ADK, Google provides an official
GKE deployment path with first-party tooling. ADK agents deploy to
GKE via the adk deploy command or the Vertex AI SDK,
and on deployment automatically inherit: managed infrastructure,
built-in VPC Service Controls, Cloud Trace integration, CMEK
encryption, and Vertex AI Agent Runtime management. ADK agents
deployed to GKE Autopilot can be exposed as A2A endpoints with
automatic Agent Card serving and authenticated access — enabling
cross-framework agent interoperability from day one. Refer to the
official ADK GKE tutorial (linked in References) for the full
deployment specification.
GitOps with ArgoCD
- All agent Kubernetes manifests, prompt ConfigMaps, and knowledge base update jobs are managed via ArgoCD. Infrastructure state is always derivable from the Git repository.
- Prompt updates create a new ConfigMap version; ArgoCD’s rolling update strategy delivers zero-downtime hot-reload without agent pod restarts.
- Model version changes go through a separate ArgoCD Application with a manual sync gate requiring AI governance team approval before applying to production.
Deployment Strategies
- Agent services: Blue-green deployment via ArgoCD Rollouts. New version is deployed alongside existing; traffic shifts only after health checks pass.
- LLM model versions: Canary via a feature flag in the harness model router. The flag is controlled by a ConfigMap — no code deployment required for model version changes.
- Knowledge base updates: Incremental batch jobs that append new entries and retire superseded ones. No downtime; retrieval continues against the stable existing index during the update.
15. Cost Optimization
Token Cost Levers (Highest Impact First)
| Lever | Mechanism | Typical Saving |
|---|---|---|
| Model tier routing | Route triage, classification, and simple generation tasks to low-tier models (Haiku class). Reserve mid/high tier for tasks where lower tiers demonstrably fail. | 60–80% reduction on routed workloads |
| Prompt caching | Provider-side prompt caching (Anthropic, OpenAI) reduces cost on the cached portion of the prompt (typically the stable system prompt + knowledge base context) by ~90%. | 30–60% overall, depending on system prompt size and request volume |
| Context compression | Contextual compression removes low-relevance sentences from retrieved chunks before injection. Reduces retrieved context tokens by 30–50% with minimal recall loss. | 15–30% on retrieval-heavy agents |
| Embedding cache | Cache query embeddings and retrieval results for semantically similar queries (cosine similarity above threshold). Eliminates redundant embedding API calls and database queries. | 10–25% on high-query-volume agents |
| Self-hosted inference | vLLM on GKE GPU nodes for high-volume, latency-tolerant workloads using open-source models. Near-zero per-token cost after GPU amortization. | Near 100% API cost elimination for migrated workloads |
Cost Monitoring & Governance
- Per-agent, per-model, and per-tenant cost counters updated at inference time feed real-time Grafana dashboards.
- Budget burn-rate alerts fire when projected monthly spend exceeds 80% of budget, leaving time to adjust routing rules before overrun.
- Weekly cost efficiency reports compare actual token usage against budget per agent and flag agents where actual usage consistently exceeds 90% of budget.
- Model routing rules are reviewed quarterly against current provider pricing and workload accuracy data to re-optimize the tier assignments.
16. Architecture Diagrams
System Architecture Overview
flowchart TB
subgraph Clients["Client Layer"]
UI["Web / Mobile UI"]
API["API Consumers"]
MSG["Messaging Channels\n(WhatsApp · Slack · Telegram)"]
end
subgraph GW["API Gateway (GKE)"]
APIGW["Nginx / FastAPI Gateway\nRate Limiting · Auth · PII Tokenization"]
end
subgraph Orchestration["Orchestration Layer (GKE)"]
COORD["Coordinator Agent\nIntent Classification · Routing"]
TEMPORAL["Apache Temporal\nDurable Workflow Engine"]
end
subgraph Harness["Agent Harness (GKE)"]
CTX["Context Assembly\nRetrieval · Compression · Slot Filling"]
EXEC["Tool Execution Sandbox\nSchema Validation · Authorization"]
VALID["Output Validation\nConstraint Check · PII Scan · Confidence"]
end
subgraph Agents["Specialist Agent Pool (GKE)"]
PLANNER["Planner Agent\nHigh-tier model"]
EXECUTOR["Executor Agents\nMid-tier model"]
CRITIC["Critic / Verifier\nLow-to-mid-tier"]
SYNTH["Synthesizer\nHigh-tier model"]
MEMORY_CONS["Memory Consolidator\nLow-tier · Background"]
end
subgraph Memory["Memory and Knowledge (GKE)"]
REDIS["Redis\nHot Cache · Sessions"]
PG["PostgreSQL + pgvector\nVector Search · Audit Log"]
NEO4J["Neo4j\nKnowledge Graph"]
GCS["Object Storage\nCold Archive"]
end
subgraph LLM["LLM Providers"]
ANTHROPIC["Anthropic API\nHaiku · Sonnet · Opus"]
OPENAI["OpenAI API\nGPT-4o"]
VLLM["vLLM (GKE GPU Pool)\nOpen-Source Models"]
end
subgraph Obs["Observability"]
PROM["Prometheus + Grafana"]
LOKI["Loki — Reasoning Traces"]
ARIZE["Arize AI — Drift Detection"]
TEMPO["Tempo — Distributed Tracing"]
end
KAFKA["Apache Kafka\nAgent Messaging · DLQ"]
Clients --> GW --> COORD
COORD <--> TEMPORAL
COORD --> KAFKA --> Agents
Agents --> Harness
Harness --> Memory
Harness --> LLM
Agents --> Obs
COORD --> Obs
Agent Decision Flow (Single Agent)
flowchart LR
INPUT["Inbound Request"] --> SANITIZE
subgraph Prep["Harness — Context Assembly"]
SANITIZE["Input Sanitization\nPII tokenization · Injection detection"]
RETRIEVE["Retrieval Pipeline\nBM25 + Dense embedding + RRF re-ranking"]
COMPRESS["Contextual Compression\nRemove low-relevance sentences"]
ASSEMBLE["Slot Assembly\nEnforce token budgets per slot"]
end
SANITIZE --> RETRIEVE --> COMPRESS --> ASSEMBLE
subgraph Inference["LLM Inference"]
CACHE["Cache Check\nPrompt cache + Embedding cache"]
LLM["LLM API Call\nmodel selected by router"]
CACHE -- "cache miss" --> LLM
CACHE -- "cache hit" --> VALID
end
ASSEMBLE --> CACHE
subgraph ToolLoop["Tool Execution Loop"]
TOOL_AUTH["Tool Authorization\nAllowlist check · Schema validate"]
TOOL_EXEC["Sandboxed Execution\nResult injected then return to LLM"]
end
LLM -- "tool call" --> TOOL_AUTH --> TOOL_EXEC --> LLM
subgraph Validation["Output Validation"]
VALID["Schema + Constraint Check"]
FACT["Fact Verification\nClaims vs. knowledge base"]
PII["PII Scanner"]
CONSISTENCY["Consistency Verification\n2nd-pass lightweight model"]
CONFIDENCE["Confidence Threshold"]
end
LLM -- "final response" --> VALID --> FACT --> PII --> CONSISTENCY --> CONFIDENCE
CONFIDENCE -- "pass" --> AUDIT["Append-Only Audit Record"]
CONFIDENCE -- "below threshold" --> HUMAN["Human Review Queue\nTemporal long-running activity"]
AUDIT --> RESPONSE["Deliver Response"]
HUMAN --> RESPONSE
Multi-Agent Collaboration: Planner to Executors to Critic to Synthesizer
sequenceDiagram
participant USER as User / Upstream System
participant COORD as Coordinator
participant PLANNER as Planner Agent
participant KAFKA as Kafka
participant EX1 as Executor A
participant EX2 as Executor B
participant CRITIC as Critic / Verifier
participant SYNTH as Synthesizer
participant TEMPORAL as Temporal Workflow
USER->>COORD: Submit complex goal
COORD->>TEMPORAL: Start durable workflow
TEMPORAL->>KAFKA: Dispatch to Planner queue
KAFKA->>PLANNER: Consume goal
PLANNER->>PLANNER: Decompose into sub-tasks and select tools
PLANNER-->>KAFKA: Publish execution plan
KAFKA-->>TEMPORAL: Consume plan
par Parallel execution
TEMPORAL->>KAFKA: Dispatch sub-task A to Executor A queue
KAFKA->>EX1: Consume sub-task A
EX1->>EX1: Retrieve context, LLM inference, tool calls
EX1-->>KAFKA: Publish result A
and
TEMPORAL->>KAFKA: Dispatch sub-task B to Executor B queue
KAFKA->>EX2: Consume sub-task B
EX2->>EX2: Retrieve context, LLM inference, tool calls
EX2-->>KAFKA: Publish result B
end
KAFKA-->>TEMPORAL: Both results received
TEMPORAL->>KAFKA: Dispatch to Critic queue
KAFKA->>CRITIC: Consume results A and B
CRITIC->>CRITIC: Verify claims, check consistency, score confidence
CRITIC-->>KAFKA: Verification report pass or issues found
KAFKA-->>TEMPORAL: Consume verification
alt Issues found
TEMPORAL->>KAFKA: Re-dispatch failing sub-task with critic feedback
end
TEMPORAL->>KAFKA: Dispatch to Synthesizer queue
KAFKA->>SYNTH: Consume verified results
SYNTH->>SYNTH: Combine, resolve conflicts, generate final output
SYNTH-->>USER: Final response plus reasoning trace plus audit trail ID
Memory Retrieval Interaction
sequenceDiagram
participant AGENT as Agent (any archetype)
participant REDIS as Redis (Hot Cache)
participant PG as PostgreSQL + pgvector
participant NEO4J as Neo4j (Knowledge Graph)
participant ASSEMBLER as Context Assembler
AGENT->>REDIS: Check embedding cache (query hash)
alt Cache hit (TTL valid)
REDIS-->>AGENT: Cached chunks
else Cache miss
AGENT->>PG: Dense embedding search (top-10 chunks)
AGENT->>PG: BM25 keyword search (top-10 chunks)
AGENT->>PG: RRF fusion to top-5 ranked chunks
PG-->>AGENT: Retrieved chunks + source identifiers
AGENT->>NEO4J: Graph query (entity relationships, 2-hop)
NEO4J-->>AGENT: Related entity context
AGENT->>PG: Query expansion variants (2-3 paraphrases)
PG-->>AGENT: Additional chunks (merged and de-duplicated)
AGENT->>REDIS: Cache merged result (configurable TTL)
end
AGENT->>PG: Fetch recent episodic decisions (last N entries)
PG-->>AGENT: Decision history (summarized if needed)
AGENT->>ASSEMBLER: Combine rules, retrieved chunks, graph context, history, request
ASSEMBLER->>ASSEMBLER: Enforce slot budgets and apply contextual compression
ASSEMBLER-->>AGENT: Final prompt within context window
Context Window Budget (Long-Context Synthesizer Agent)
pie title Context Budget — Synthesizer Agent (30,000 tokens)
"System Prompt + Rules" : 3000
"Retrieved Context (multi-source)" : 20000
"Conversation History" : 2000
"Output Reservation" : 5000
Security & Audit Data Flow
flowchart LR
subgraph Ingress["Ingress"]
INPUT["Raw User Input / External Data"]
PII_TOK["PII Tokenizer\nvault — agents see tokens only"]
end
subgraph AgentExec["Agent Execution"]
HARNESS["Agent Harness\nauthorization · sandboxing · retry"]
LLM_CALL["LLM API Call\ntokens, never raw PII"]
TOOL["Tool Execution\nsandboxed · allowlisted"]
end
subgraph OutputProc["Output Processing"]
PII_SCAN["PII Scanner\nredact any leaked sensitive data"]
SCHEMA_VAL["Schema + Constraint Validator"]
AUDIT_WRITE["Append-Only Audit Log\nimmutable — deletes blocked"]
end
subgraph Delivery["Delivery and Observability"]
CLIENT["Client / Downstream System"]
LOKI_LOG["Loki — Reasoning Trace"]
ARIZE["Arize AI — Async Export"]
end
INPUT --> PII_TOK --> HARNESS --> LLM_CALL --> TOOL --> HARNESS
HARNESS --> PII_SCAN --> SCHEMA_VAL --> AUDIT_WRITE
AUDIT_WRITE --> CLIENT
AUDIT_WRITE --> LOKI_LOG
AUDIT_WRITE --> ARIZE
Appendix A: Glossary
| Term | Definition |
|---|---|
| Agent Harness | The execution layer wrapping every LLM call: context assembly, tool authorization and sandboxing, output validation, retry logic, memory persistence, and observability instrumentation. In 2026, the harness is understood as the primary determinant of agent reliability — more impactful than model selection alone. |
| Context Engineering | The discipline of deciding what information to place in an agent's context window, how much space to allocate to each information type, and how to compress when the budget is exceeded. Recognized as a distinct and more impactful practice than prompt engineering alone. |
| Harness Engineering | The 2026 paradigm encompassing the design of the entire agent execution system — prompt design, context management, tool integration, memory architecture, guardrails, and observability — rather than focusing on prompt wording in isolation. |
| RAG | Retrieval-Augmented Generation. The pattern where relevant documents are retrieved from a knowledge base and injected into the LLM prompt before generation, grounding the model's output in verifiable sources. |
| CoT (Chain-of-Thought) | A prompting technique instructing the model to reason step-by-step before producing a final answer. Improves accuracy on complex multi-step tasks and produces auditable reasoning traces. |
| Hallucination | When an LLM generates plausible-sounding but factually incorrect or fabricated content. The primary reliability risk in production agent systems. |
| Grounding | Constraining LLM outputs to claims verifiable against a known-good knowledge source. The primary mitigation for hallucination. |
| Prompt Injection | An attack where adversarial content in user input or processed documents attempts to override the model's system instructions and cause it to take unintended actions. |
| Memory Dreaming | A pattern originated by OpenClaw where a background consolidator model distills recent short-term memory into long-term vector storage during idle periods, enabling agents to maintain coherence over extended operation without context overflow. |
| Phase-Gating | A workflow control pattern where a complex task is divided into explicit phases (plan, gather, draft, review, deliver). No phase begins until the previous phase's output has been validated. Prevents early-stage errors from propagating through the entire workflow. |
| RRF (Reciprocal Rank Fusion) | An algorithm for combining ranked result lists from multiple retrieval systems (e.g., BM25 keyword search + dense embedding search) into a single unified ranking. Standard approach for hybrid retrieval. |
| pgvector | A PostgreSQL extension that adds vector data types and approximate nearest-neighbor search (ANN) indices for semantic similarity queries. Enables vector retrieval without a separate dedicated vector database. |
| HNSW | Hierarchical Navigable Small World — a graph-based ANN index algorithm offering fast approximate nearest-neighbor search with high recall. The recommended index type for pgvector production deployments. |
| MCP (Model Context Protocol) | An open protocol for connecting LLMs to external tools, data sources, and capabilities in a standardized interface. MCP servers expose tool capabilities via a declared manifest; MCP-aware harnesses discover and invoke them without bespoke integrations. Adopted by Anthropic (Claude), OpenAI, Google ADK, Microsoft Agent Framework, and major enterprise platforms. |
| A2A (Agent-to-Agent Protocol) | Linux Foundation open standard for inter-agent communication across frameworks and vendors. 150+ supporting organizations (Google, Microsoft, AWS, Oracle, Databricks, Snowflake, Salesforce, IBM, SAP). v1.0 released 2026. Integrated into Google ADK, Microsoft Agent Framework, LangGraph, Semantic Kernel, Amazon Bedrock AgentCore, and Azure AI Foundry. Complements MCP: MCP connects agents to tools; A2A connects agents to agents. Financial services is a confirmed production adoption vertical. |
| Google ADK | Google's Agent Development Kit — open-source enterprise agent framework (Apache 2.0), GA 2026 in Python, TypeScript, Go, and Java. Defines workflows as typed directed graphs; native GKE deployment; Vertex AI integration; built-in eval framework; A2A first-class. The recommended framework for GKE-native agent deployments on Google Cloud. |
| Microsoft Agent Framework (MAF) | The production GA successor (April 2026) that merges AutoGen and Semantic Kernel. Supports Python and .NET, graph-based orchestration, session management, MCP + A2A, and hosted deployment on Azure AI Foundry. AutoGen is in maintenance mode — new projects should use MAF. |
| LLM-as-Judge | An evaluation pattern where a separate LLM evaluates the quality of another agent's output. Enables scalable evaluation beyond what human review alone can cover. Has documented biases (verbosity preference, self-preference) and cannot detect subtle factual hallucinations — must be combined with rule-based metrics and calibrated against human labels. |
| MLflow | Open-source AI engineering platform (30M+ monthly downloads) for experiment tracking, model evaluation, and agent evaluation. Recommended platform for logging agent runs, running batch evals against labeled corpora, and comparing prompt/model versions before production promotion. |
| vLLM | A high-throughput open-source LLM inference server using PagedAttention to maximize GPU utilization. Used for self-hosting open-source models on GKE GPU node pools. |
| Apache Temporal | An open-source durable workflow engine. Guarantees exactly-once execution of workflow steps even through process crashes. Recommended for all production multi-step agent workflows requiring durability and audit trails. |
Appendix B: Model Comparison
| Model | Provider | Context Window | Relative Cost | Strengths | Weaknesses | Recommended Agent Role |
|---|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 200K | $$$$ | Highest reasoning quality; exceptional instruction following; very long context; best for complex synthesis | Highest cost and latency | Synthesizer, Planner (complex), long-document analysis |
| Claude Sonnet 4.6 | Anthropic | 200K | $$ | Best accuracy/cost balance; strong chain-of-thought; reliable structured output | Slower than Haiku | Coordinator, Executor (standard), most agent archetypes |
| Claude Haiku 4.5 | Anthropic | 200K | $ | Fastest response; lowest cost; suitable for high-volume triage and classification | Lower reasoning depth on complex tasks | Triage, routing, PII scanning, memory consolidation, Critic (simple checks) |
| GPT-4o | OpenAI | 128K | $$$ | Reliable; strong multi-modal reasoning; widely vetted in production | Higher cost than Sonnet; shorter context than Claude | Executor (when provider diversity is required), Planner |
| Gemini 1.5 Pro | Google Vertex AI | 1M | $$ | Largest context window available; GKE-native (low egress cost); competitive pricing | Less battle-tested for constrained structured output | Long-document analysis, multi-document Synthesizer, fallback |
| Llama 3.1 / 3.2 70B+ | Self-hosted (vLLM) | 128K | ~ (GPU amortized) | Near-zero per-token cost; data never leaves the cluster; fine-tunable; no egress | Requires GPU infrastructure ops; lower out-of-box accuracy on complex tasks | High-volume batch Executors; domain-specific tasks after fine-tuning |
Appendix C: References & Further Reading
- OWASP Top 10 for Large Language Model Applications (2025 Edition) — owasp.org/www-project-top-10-for-large-language-model-applications
- Google Agent Development Kit (ADK) — GKE Deployment Guide — cloud.google.com/kubernetes-engine/docs/tutorials/agentic-adk-vertex
- Google Agent Development Kit Documentation — adk.dev
- Microsoft Agent Framework 1.0 GA Announcement — devblogs.microsoft.com/agent-framework
- Microsoft Agent Framework Overview — learn.microsoft.com/en-us/agent-framework/overview
- A2A Protocol v1.0 — Linux Foundation AI & Data — a2a-protocol.org
- A2A Protocol: One Year Milestones (150+ Organizations) — linuxfoundation.org, April 2026
- LangGraph Production Patterns — langchain-ai.github.io/langgraph
- Agent Harness Engineering: The Rise of the AI Control Plane — Adnan Masood, Medium, April 2026
- From Prompts to Harnesses — Four Years of AI Agentic Patterns — bits-bytes-nn.github.io, April 2026
- Agent Harness for Large Language Model Agents: A Survey — Preprints.org, April 2026
- Lewis et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
- Ji et al. (2023). "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys.
- Yao et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023.
- Apache Temporal Workflow Engine Documentation — docs.temporal.io
- pgvector: Open-source vector similarity search for PostgreSQL — github.com/pgvector/pgvector
- NIST AI Risk Management Framework (AI RMF 1.0) — nist.gov/artificial-intelligence
- Anthropic Responsible Scaling Policy — anthropic.com/rsp
- vLLM: Easy, Fast, and Cheap LLM Serving — github.com/vllm-project/vllm
