LLM-Based Agentic Architecture

Production-Ready Design for Autonomous Agent Systems
Platform: GKE Audience: Architects · Engineering Leads · AI Platform Teams Classification: Internal Version: 2.0 · June 2026

1. Executive Summary

LLM-based multi-agent systems have moved from research prototypes to production infrastructure. A well-designed agent system decomposes complex, multi-step goals into specialized sub-tasks handled by purpose-built agents, each with a focused role, bounded authority, and clear communication contract. The result is a system that can tackle tasks no single model call could accomplish reliably: long-horizon research, iterative code generation, multi-source data synthesis, and any workflow requiring memory, tool use, and adaptive decision-making.

This document defines a production-ready architecture for deploying LLM-based multi-agent systems on Google Kubernetes Engine (GKE). It covers the full engineering stack: prompt construction, context window management, the agent harness, memory systems, orchestration patterns, reliability controls, security, observability, and the framework landscape as of mid-2026. It is framework-agnostic in principle but references specific open-source frameworks — including OpenClaw, DeerFlow, LangGraph, AutoGen, CrewAI, and Nanobots as design references.

Design Goals

  • Reliability over capability: A system that consistently delivers correct, verifiable outputs at 95% accuracy is more valuable than one that occasionally achieves 99% but fails unpredictably.
  • Harness-first thinking: The agent harness — not the underlying model — is the binding constraint on production agent quality. Model selection matters less than how the system wraps, validates, and recovers from model calls.
  • Bounded autonomy: Agents operate within explicitly declared authority levels. No agent takes consequential action without either deterministic authorization logic or human approval.
  • Cost-conscious scaling: Tiered model routing, prompt caching, and context compression reduce inference costs by 60–80% compared to naïvely using frontier models for all tasks.
  • Full auditability: Every agent decision — reasoning trace, retrieved context, tool calls, outputs, validation results — is logged immutably and indexed for replay.

2. The 2026 LLM Agent Paradigm

The discipline of building reliable LLM agents has passed through three distinct paradigm shifts since 2022. Understanding this evolution is essential for making the right architectural investments.

2022 – 2024

Prompt Engineering

“What should I say to the model?” Practitioners focused on instruction quality, zero-shot vs. few-shot phrasing, and chain-of-thought formatting. The model was treated as a black box to be coaxed.

2024 – 2025

Context Engineering

“What information should fill the context window?” As RAG matured, it became clear that what the model sees matters more than how instructions are phrased. Context window management became a first-class engineering discipline.

2025 – Present

Harness Engineering

“What system should I build around the model?” The harness — execution environment, tool integrations, memory, retry logic, guardrails, output validation — is now understood as the binding constraint on agent reliability, not the model itself.

Design Principle All three disciplines remain relevant and are additive. A production agent system requires excellence in prompt construction and context assembly and harness design. Investing only in prompts while neglecting the harness produces agents that fail unpredictably at scale.

The Agent as a System

A single LLM agent is best understood not as “a model with a prompt” but as a closed-loop reasoning system composed of five interacting subsystems:

  • Perception: What the agent receives — user input, tool results, retrieved memory, inter-agent messages.
  • Context assembly: How the agent’s working context is constructed from perception inputs, constrained by the token budget.
  • Reasoning: The LLM inference call that produces structured outputs, tool calls, or sub-task delegations.
  • Action: Tool execution, state mutation, or message dispatch — always mediated by the harness, never directly by the model.
  • Memory: What persists across turns and sessions — short-term context, long-term knowledge, episodic decision history.

Agent quality degrades when any of these subsystems is under-engineered. The most common failure mode is treating context assembly and harness as afterthoughts while over-investing in prompt wording.

3. Agent Design Patterns & Taxonomy

Rather than designing agents around specific domain tasks, production systems should be built from a small vocabulary of generic agent archetypes. Domain specialization is delivered through the agent’s system prompt, knowledge base, and tool set — not through bespoke code per use case.

Core Agent Archetypes

Archetype Primary Role Authority Level Typical Model Tier Latency Class
Coordinator / Orchestrator Receives goals, decomposes them into sub-tasks, routes to specialist agents, tracks overall progress System-level Mid-tier (Sonnet class) Fast (routing overhead only)
Planner Produces a structured execution plan from a high-level goal; selects tools, agents, and ordering Plan only — no execution High-tier (Opus/GPT-4 class) Moderate (seconds)
Executor Carries out atomic, well-scoped tasks assigned by Planner; uses tools; produces structured outputs Autonomous within scope Mid-tier (Sonnet class) Fast to moderate
Critic / Verifier Reviews outputs from Executor agents; identifies errors, hallucinations, or constraint violations Block output on failure Low-to-mid tier (Haiku/Sonnet) Fast (verification pass)
Synthesizer Combines outputs from multiple Executor agents into a coherent final result; resolves conflicts Human review for high-stakes output High-tier (Opus class for complex synthesis) Moderate (long context)
Memory Consolidator Runs on a schedule or during idle periods; compresses short-term memory into long-term vector storage Fully autonomous Low-tier (Haiku class) Async / background
Retrieval Agent Manages knowledge retrieval — query expansion, hybrid search, re-ranking — on behalf of other agents Read-only Low-tier (Haiku class) Fast (<500ms)

Authority Levels

Authority levels are enforced by the harness, not the prompt. A prompt-only authority declaration can be overridden by prompt injection or model drift; harness-enforced authority cannot.

  • Fully Autonomous: Output is delivered or action is taken directly. Reserved for low-risk, reversible operations within a pre-authorized scope (e.g., reading data, sending a draft message for review).
  • Human-in-Loop Required: The agent produces a decision and reasoning trace; a human must explicitly approve before any action is taken. Applied to all irreversible or high-impact actions.
  • Advisory Only: Agent output feeds a human dashboard or downstream system but is never acted upon autonomously. The agent has no tool access that mutates state.

Coordination Patterns

Pattern Structure When to Use Trade-offs
Hierarchical (Supervisor) Coordinator owns the plan; specialists report back. Tree-shaped execution graph. Complex multi-step tasks; compliance-critical workflows where a clear chain of authority is required Single point of coordination; auditable; coordinator becomes a bottleneck at high concurrency
Sequential Pipeline Agent A → Agent B → Agent C. Output of each stage is input to the next. Workflows with clear stage boundaries — research → draft → review → publish Simple to reason about; brittle if early stages fail; no parallelism
Parallel Fan-out / Fan-in Coordinator dispatches to N agents simultaneously; Synthesizer aggregates results. Independent sub-tasks that can be executed concurrently — searching multiple sources Fast; requires robust conflict resolution; output quality depends on Synthesizer
Peer-to-Peer / Debate Agents communicate directly; disagreements surface via structured debate rounds. Tasks benefiting from adversarial review — code review, risk assessment, essay critique High quality from diverse perspectives; harder to audit; non-deterministic execution graph
Event-Driven (Reactive) Agents subscribe to event streams; triggered by external events. Pattern used by OpenClaw. Long-running autonomous assistants; monitoring and alerting workflows Highly responsive; complex state management; harder to bound execution scope

Inter-Agent Communication Contract

Regardless of coordination pattern, every inter-agent message must carry a structured envelope with: originating agent identity, correlation/trace ID, the specific question or sub-task being delegated, the accumulated decision context so far, a confidence score, and a time-to-live. The harness enforces a maximum delegation depth (typically 2–3 hops) before escalating to the Coordinator, preventing unbounded recursive delegation.

When to Use Multi-Agent (and When Not To)

Start Simple — Add Agents Only When Justified Multi-agent architectures introduce real costs: coordination latency, information loss at hand-off boundaries, more complex failure modes, and significantly higher operational burden. Recent research demonstrates single agents with sufficient compute frequently match multi-agent systems on many reasoning tasks. Treat multi-agent as an optimization you add when a single-agent approach demonstrably fails — not as the default starting point.

Staged Evolution (Recommended)

  1. Stage 1 — Single agent + tools: One agent with RAG, memory, tool calling, and a workflow engine. Handles the majority of enterprise use cases. Low operational burden, easy to audit, straightforward to debug.
  2. Stage 2 — Add a Critic/Validator agent: When output quality concerns justify it, add a separate validation pass. This single addition captures most of the reliability benefit of multi-agent architectures at a fraction of the coordination overhead.
  3. Stage 3 — Add Planner + parallel Executors: Only when tasks are reliably too large for one context window, have independently parallelizable sub-tasks, or require distinct access permission scopes per sub-task.
  4. Stage 4 — Full multi-agent system: For large-scale autonomous workflows where Stages 1–3 are demonstrably insufficient. This document’s full architecture applies at this stage.

Decision Framework

Situation Recommended Architecture Reason
CRUD automation, simple Q&A, data lookup Single agent + tools Multi-agent adds coordination overhead with no accuracy benefit; single agent with RAG is faster and cheaper
Customer support, triage, routing Single agent + tools Sequential, well-bounded tasks; one well-prompted agent with retrieval outperforms a poorly coordinated swarm
Document summarization, drafting, editing Single agent (+ optional Critic) One context window is sufficient; add Critic only if quality rejection rate is unacceptably high
Code generation and review Single agent + optional Executor/Critic split MetaGPT-style multi-agent is justified only for large-scale autonomous software engineering; not for standard code tasks
Multi-source research and synthesis Multi-agent: Planner + parallel Executors + Synthesizer Parallel retrieval across independent sources provides genuine speed and coverage benefit that a single agent cannot match
Long-horizon autonomous workflow (hours to days) Multi-agent with Temporal workflow engine Context overflow across a single agent; durability requirements; sub-tasks can run independently and asynchronously
Tasks requiring different data access scopes per sub-task Multi-agent Access permission isolation per agent is a genuine security and compliance requirement, not a design preference
Monitoring across many independent sources Multi-agent with event-driven pattern (OpenClaw) Fan-out monitoring genuinely benefits from parallel specialized agents reacting to independent event streams

4. Prompt Engineering

Prompt engineering is the deliberate design of the instructions, constraints, and examples that constitute an agent’s behavioral specification. While the context window is broader than the prompt alone, the prompt provides the agent’s stable identity — the constitution that governs how it interprets every piece of context it receives.

System Prompt Architecture

All agent system prompts follow a four-block hierarchy with clearly marked delimiters. This structure is consistent across all agent types — only the content of each block changes per agent. User input and retrieved context are never placed in the system prompt; they arrive in separate, clearly labeled turns or context blocks, preventing the model from conflating instructions with data.

  • Role block: Declares the agent’s identity, single primary responsibility, and explicit scope boundary. Instructs the agent to refuse requests outside its scope rather than attempt them.
  • Rules block: Contains absolute, non-overridable constraints — behaviors that must hold regardless of any instruction in the conversation, including user input. Examples: “always cite sources for factual claims,” “never take irreversible actions without explicit approval,” “when uncertain, escalate rather than fabricate.”
  • Context block: Populated at runtime by the retrieval pipeline. Contains relevant knowledge base chunks, retrieved documents, few-shot examples, and any agent-specific state. This block is dynamic and varies per request.
  • Task block: Specifies the current request and the required structured output format, including schema constraints the model must satisfy.
Instruction Hierarchy Models follow instructions in priority order: system prompt rules block > system prompt role block > retrieved context > user input. Any content that arrives via the context block or user turn cannot override the rules block. This hierarchy must be explicitly stated in the rules block itself, since models trained on general data may otherwise treat all turns as equally authoritative.

Few-Shot Strategy

Few-shot examples dramatically improve structured output quality and constraint adherence. The key design decision is whether examples are static (hardcoded in the system prompt) or dynamic (retrieved from the knowledge base at runtime).

  • Dynamic retrieval is strongly preferred for production systems. Examples are stored in the knowledge base and retrieved by semantic similarity to the current request. This allows the example pool to be updated, curated, and improved without a deployment.
  • Negative examples are as important as positive ones. Show the model what a bad or out-of-scope response looks like, paired with the correct refusal or escalation.
  • “I don’t know” examples are mandatory for any agent operating in a domain with bounded knowledge. The model must see examples where the correct answer is explicit escalation or acknowledgment of uncertainty — not a fabricated response.
  • Format examples as structured reasoning traces (chain-of-thought), not just input/output pairs. This trains the model to show its work, which is essential for validation and audit.

Prompt Versioning & Governance

System prompts are first-class software artifacts. They must be version-controlled, reviewed, tested, and deployed with the same rigor as code.

  • Prompts are stored in a Git-based registry as versioned, tagged files — reviewed via pull request by the AI platform team before merging.
  • Each deployed agent references a specific prompt version. The harness rejects agent startup if the referenced prompt version is not in the approved registry.
  • Changes to the rules block require a higher-level approval gate than changes to the task or context blocks.
  • New prompt versions are validated against the full regression test suite before promotion to production.
  • Deployed prompts are loaded via configuration at agent startup; hot-reload is supported without pod restarts for low-risk updates.

Prompt Injection Mitigation

Attack Vector Mitigation
Direct injection in user input (“ignore previous instructions”) All user input is delivered in a separate, clearly labeled turn. The rules block explicitly instructs the model that user-turn content is data to be acted upon, not instructions. Jailbreak patterns are detected by embedding similarity against a known-attack library before the request reaches the LLM.
Indirect injection via processed documents (emails, web pages, files) All externally retrieved content is injected in distinctly labeled source blocks. The model is instructed that content inside source blocks represents external data and cannot override system rules.
Role-play / persona bypass A post-generation output validation layer checks whether the model’s response violated any declared rule. Violations trigger a discard-and-regenerate cycle (max 2 retries), after which the request is escalated.
Knowledge base poisoning Knowledge base entries are stored with cryptographic integrity hashes. On retrieval, hashes are re-verified; mismatches quarantine the entry and trigger an alert. All knowledge base writes require an authenticated, audited write path — agents only have read access.

5. Context Engineering & Window Management

Context engineering is the discipline of deciding what information to place in an agent’s context window, how much space to allocate to each category, how to compress when the budget is exceeded, and how to keep the most relevant information at the top of the window where models attend most strongly.

Research consistently shows that context quality — the relevance and density of information in the window — has a larger impact on output quality than prompt wording. Filling the window with low-relevance retrieved chunks actively degrades performance. Context engineering is therefore not a minor optimization; it is central to agent reliability.

Context Window as a Managed Resource

Every agent’s context window is divided into explicit, budget-controlled slots. The prompt assembly layer enforces these budgets before each LLM call — never relying on the model to manage its own attention.

30% System & Rules
35% Retrieved Context
20% Conv. History
15% Output Reserve
Agent Archetype System + Rules Retrieved Context History Output Reserve Recommended Total Budget
Coordinator / Orchestrator 2,000 1,500 1,500 1,000 6,000
Executor (standard task) 2,000 3,000 1,000 1,000 7,000
Critic / Verifier 1,500 4,000 500 500 6,500
Synthesizer (multi-source) 3,000 20,000 2,000 5,000 30,000
Long-context analysis agent 4,000 150,000 2,000 4,000 160,000 (Opus / Gemini 1.5)
Triage / routing agent 1,000 500 500 250 2,250

Context Assembly Pipeline

Before each LLM call, the context assembly pipeline runs in this order:

  1. Query analysis: Extract intent, entities, and required knowledge domains from the incoming request.
  2. Parallel retrieval: Simultaneously execute sparse keyword search (BM25) and dense embedding search over the knowledge base. Run graph queries for entity relationships where applicable.
  3. Re-ranking: Apply Reciprocal Rank Fusion (RRF) to merge sparse and dense results. Apply a domain-specific re-ranker model if available. Select the top-K chunks by relevance score.
  4. History compression: If conversation history exceeds the allocated slot, run a lightweight summarization pass (using a low-cost model) to condense older turns.
  5. Slot assembly: Fill each slot in priority order: rules → retrieved context → compressed history → current request. Enforce byte-level token counts per slot.
  6. Overflow handling: If assembled context still exceeds the model’s window after compression, drop lowest-scoring retrieved chunks first, then older history. If after two rounds the window is still exceeded, reject the request and escalate — never silently truncate.

Retrieval Quality

[ASSUMPTION] Hybrid BM25 + dense vector retrieval with RRF fusion achieves significantly higher recall than either method alone on domain-specific corpora. Retrieval quality is highly domain-dependent — a legal corpus, a software engineering knowledge base, and a general Q&A corpus behave very differently. Establish task-specific retrieval benchmarks (recall@K, precision@K, MRR) and acceptance criteria before deployment rather than relying on universal recall targets.
  • Embedding model selection: Choose an embedding model trained on content similar to the target domain. General-purpose embeddings perform well for broad tasks; domain-specific fine-tuned embeddings improve recall on specialized corpora.
  • Chunk sizing: Chunks of 256–512 tokens generally outperform longer chunks for retrieval precision. Long documents should be chunked with sentence-boundary awareness and overlapping windows to avoid splitting reasoning context.
  • Query expansion: A lightweight pre-retrieval step generates 2–3 query variants (paraphrases, related terms) and retrieves against all variants. Dramatically improves recall for ambiguous or terse user queries.
  • Contextual compression: After retrieval, a compression pass extracts only the sentences from each chunk that are directly relevant to the query, reducing noise in the context window without losing key information.

Long-Running Agent Context Management

Agents that operate over extended periods (hours to days) face a unique challenge: context accumulated across thousands of turns cannot fit in any context window. Frameworks like OpenClaw address this with a Memory Dreaming pattern — during idle periods, a background consolidator model distills short-term conversation history into structured long-term memory entries, which are then retrievable via semantic search. This pattern is recommended for any agent designed to maintain coherence across sessions.

A complementary approach, used by Claude Code and similar developer agents, is a virtualized filesystem: the agent externalizes working state to structured artifacts (plan files, task lists, decision logs) that persist independently of the context window and are selectively loaded into context as needed.

6. Agent Harness Design

The agent harness is the execution layer that wraps every LLM inference call. It is responsible for context assembly, tool execution, output validation, retry logic, memory persistence, and observability instrumentation. In 2026, industry consensus has converged on a key principle: the harness is the binding constraint on agent reliability, not the underlying model.

Core Harness Principle The LLM never directly executes tools. The model returns a structured tool call specification; the harness validates the schema, checks authorization, executes the tool in an isolated environment, and injects the result back into the next context window. This single pattern eliminates entire categories of security vulnerabilities and makes agent behavior auditable by construction.

Harness Responsibilities

Responsibility Description
Context Assembly Orchestrates the retrieval, compression, and slot-filling pipeline described in Section 5. Enforces token budgets before the LLM call is made.
Model Routing Selects the appropriate model tier based on task complexity, context size, authority requirements, and cost budget. Manages fallback to secondary models when primary is unavailable.
Tool Authorization Validates that the requested tool call is in the agent’s declared allowlist. Checks parameter schemas. Rejects unauthorized tool calls before execution.
Tool Execution Sandbox Executes tools in an isolated environment (container, WebAssembly module, or subprocess). Enforces timeouts, memory limits, and network egress restrictions. Tool results are sanitized before injection into the next context.
Output Validation Validates model outputs against declared schemas and constraint rules. Runs PII detection. Computes confidence signals. Triggers re-generation or escalation on validation failure.
Retry & Recovery Manages transient failure recovery: exponential backoff on API errors, fallback to secondary models on provider failure, context sanitization and restart on persistent validation failures.
Memory Persistence Writes decision records, reasoning traces, and state updates to the appropriate memory tier before returning the response.
Observability Instrumentation Emits structured spans for every harness operation: model call, tool execution, retrieval, validation. Tracks token counts, latency, cost, and confidence at every step.

The Agent Reasoning Loop

The harness implements the agent’s core reasoning loop — a multi-turn cycle that runs until the model produces a final text-only response (no further tool calls):

  1. Assemble context from memory, retrieval pipeline, and current conversation state.
  2. Call the LLM with the assembled context.
  3. If the model returns a tool call: validate authorization → execute in sandbox → inject result → return to step 1.
  4. If the model returns a delegation request: emit message to target agent queue → await response (async) or block (sync, with timeout) → inject result → return to step 1.
  5. If the model returns a final response: run output validation → check confidence → persist to memory → deliver or escalate.
  6. If any step produces an unrecoverable error after retries: emit escalation event → persist failure record → return error response.

Bounded, Deterministic Workflows

Production experience across the industry has strongly validated the Supervisor Pattern over unconstrained agent swarms. Key principles:

  • Phase-gating: Complex workflows are divided into explicit phases (plan → gather → draft → review → deliver). No phase begins until the previous phase’s output has been validated. This prevents cascading errors from propagating through the entire workflow.
  • Bounded loops: Every reasoning loop has a hard maximum iteration count enforced by the harness, not the prompt. Unconstrained loops are the most common cause of runaway cost and infinite regress.
  • Explicit termination conditions: The harness defines success and failure criteria independently of the model. A task completes when the harness validates the output satisfies the success criteria — not when the model claims it is done.
  • Prefer simpler graphs: Resist the temptation to model every possible execution path. A flat, sequential pipeline with explicit branching at known decision points is more reliable and easier to debug than a fully dynamic agent graph.

Model Routing

The harness selects the most cost-efficient model that meets the task’s requirements at runtime:

Condition Model Tier Rationale
Triage, routing, simple lookup, PII scanning Low tier (Haiku class) Lowest cost; fast; more than sufficient for classification and simple generation tasks
Standard analysis, structured output, multi-step reasoning Mid tier (Sonnet class) Best accuracy/cost balance; suitable for the majority of agent tasks
Complex synthesis, long-document analysis, high-stakes decisions High tier (Opus / GPT-4 class) Highest reasoning quality; use only where lower tiers demonstrably fail
High-volume batch workloads, cost-sensitive, non-real-time Self-hosted open-source (Llama class via vLLM) Near-zero per-token cost after GPU amortization; data does not leave the cluster

Tool Integration via MCP (Model Context Protocol)

The Model Context Protocol (MCP) has become the standard abstraction for connecting agents to external tools, data sources, and capabilities. Rather than each agent implementing bespoke tool integrations, MCP-compatible servers expose capabilities through a standardized interface that any MCP-aware harness can discover and invoke. This approach is now adopted by Anthropic’s Claude, OpenAI, Hermes Agent, Nanobots, and major enterprise platforms.

  • MCP servers: Each tool category (database access, file system, web search, API calls, code execution) is exposed as a separate MCP server with a declared capability manifest. Agents discover available tools from the manifest at startup — no hardcoded integrations in the harness.
  • MCP governance: Server manifests declare required permissions and data access scopes. The harness validates that the requesting agent’s allowlist includes the specific MCP server and capability before establishing a connection. Unsigned or unregistered MCP servers are rejected at the harness layer.
  • MCP security model: Each MCP server runs in its own sandboxed process with declared network egress, filesystem access, and API scopes — enforced by the container runtime, not by the server’s own code. Tool results returned from MCP servers are sanitized (size limits, executable content stripped) before injection into agent context. A compromised MCP server cannot affect the harness process.
  • MCP in production on GKE: Host each MCP server as a separate GKE Deployment with its own service account, resource limits, and network policy. This enables independent scaling, versioning, rollback, and security policy per tool category — far more operationally manageable than bundled tool integrations.

7. Memory Architecture

Agent memory is organized into three tiers by temporal scope, with distinct storage backends and access patterns for each. A well-designed memory system lets agents maintain coherence across long sessions and leverage accumulated knowledge without ever overflowing their context window.

Memory Tiers

Tier Scope Storage TTL / Retention What Is Stored
Short-Term (Working) Current session / task In-process state + Redis Session lifetime Active conversation turns, intermediate reasoning steps, tool results from this session
Episodic (Mid-Term) Recent interactions Redis (hot) + Vector DB (warm) Days to weeks Summarized past sessions, recent decisions with outcomes, user preferences and corrections
Semantic (Long-Term) Persistent knowledge Vector DB + Knowledge Graph Indefinite (versioned) Domain knowledge base, policy documents, curated few-shot examples, entity relationships

Short-Term Memory Management

The default working context retains the most recent N turns, where N is determined by the agent’s allocated history slot in the token budget. When the history slot fills to 70% of capacity, the harness automatically triggers a summarization pass using a low-cost model, condensing older turns into a compact summary object that replaces them in context. This keeps the context window from filling with stale conversational scaffolding while preserving the essential reasoning thread.

Long-Term Memory & Memory Dreaming

Inspired by OpenClaw’s Memory Dreaming architecture, long-term memory consolidation runs as a background process during idle periods. A lightweight consolidator model reviews the agent’s recent short-term memory, extracts key facts, decisions, preferences, and outcomes, and writes structured entries to the long-term vector store. This mimics biological memory consolidation during sleep and enables agents operating over days or weeks to maintain coherence without ever exceeding their context window.

All consequential decisions — task completions, escalations, tool call outcomes, user approvals — are persisted as immutable records to an append-only decision log before the response is returned.

Storage Architecture

Store Technology Purpose
Hot cache Redis 7 Active session state, embedding cache, rate limit counters. LRU eviction keeps the most-used context in memory.
Vector store pgvector (PostgreSQL extension) Semantic search over knowledge base and episodic memory. HNSW index for ANN search. Zero new infrastructure if PostgreSQL is already in the stack.
Knowledge graph (Optional) Neo4j Entity relationships, multi-hop reasoning across connected data. Add only when entity relationships are a proven core requirement and vector search alone demonstrably fails to capture them. Carries real operational cost: separate infrastructure, backup strategy, consistency model, and expertise. Many production deployments never require this layer — start without it and add only when justified by evidence.
Relational store PostgreSQL Audit log, decision history, agent registry, model registry. Append-only tables with hard deletes disabled at the database layer.
Cold archive Object storage (GCS) Raw reasoning traces, long-term data retention beyond warm-tier TTL.

Knowledge Base Governance

Knowledge base entries are version-controlled with effective-from and effective-until markers. Superseded entries are retired (excluded from retrieval) but never deleted — they remain accessible for audit and historical context. All writes to the knowledge base require an authenticated, audited write path; agents have read-only access. Knowledge base entries are stored with content integrity hashes; retrieval re-verifies hashes before injecting content into agent context.

8. Multi-Agent Orchestration

Recommended Pattern: Hierarchical Supervisor

For the majority of enterprise agent deployments, the hierarchical supervisor pattern is the right choice. A Coordinator agent receives all inbound requests, classifies intent, and routes to the appropriate specialist agents. Specialists may sub-delegate once (at most twice) before returning results to the Coordinator. This produces a tree-shaped execution graph that is auditable, debuggable, and controllable.

Why Not Peer-to-Peer? Peer-to-peer agent graphs create non-deterministic execution paths that are difficult to audit and even harder to debug when something goes wrong. The supervisor pattern’s single authoritative root is a feature, not a limitation — it is what makes the system explainable to operators and reviewers.

Workflow Engine

Complex multi-step agent workflows (involving tool calls, human-in-loop waits, retries, and conditional branching) must be backed by a durable workflow engine — not simple async code. The workflow engine guarantees that workflows survive process crashes, network failures, and model errors by replaying from the last successful checkpoint. Apache Temporal is the recommended engine for GKE deployments; it supports:

  • Durable, retryable activities with exponential backoff.
  • Long-running human-approval activities that pause indefinitely until a human approval signal is received.
  • Full execution history as an immutable audit trail — Temporal’s event history is the authoritative record of what each agent did and when.
  • Deterministic workflow replay for debugging and investigation.

Communication Protocol

  • Asynchronous (Kafka): Default for high-volume internal agent-to-agent messaging within the cluster. A dedicated topic per agent archetype, with consumer group scaling. Messages carry the full delegation envelope (Section 3). Dead-letter queues capture messages that fail processing after the maximum retry count.
  • Synchronous (gRPC / HTTP): Used only when the Coordinator must block on a sub-agent result before proceeding — e.g., a fast lookup on the critical path. Hard timeout enforced; failure routes to the async fallback path.
  • A2A Protocol: The interoperability layer for agent-to-agent communication across framework and vendor boundaries. See below.

Agent-to-Agent Interoperability: A2A Protocol

The Agent-to-Agent (A2A) Protocol is the Linux Foundation open standard for inter-agent communication. Originally developed by Google and donated to LF AI & Data, A2A v1.0 was released in early 2026 with over 150 supporting organizations: Google, Microsoft, AWS, Oracle, Databricks, Snowflake, Salesforce, ServiceNow, SAP, and IBM. It is natively integrated into Google ADK, Microsoft Agent Framework, LangGraph, Semantic Kernel, CrewAI, Amazon Bedrock AgentCore, and Azure AI Foundry. Financial services is a confirmed production adoption vertical.

A2A and MCP have distinct, complementary roles — one is not a replacement for the other:

Protocol Connects What It Handles
MCP (Model Context Protocol) Agent → Tool / Data Source How an agent calls a database, API, file system, or code execution environment. Defined in Section 6.
A2A (Agent-to-Agent Protocol) Agent → Agent How a Coordinator delegates to a specialist agent; how agents built on different frameworks discover and interoperate with each other.

Key A2A production properties relevant to this architecture:

  • Signed Agent Cards: Each agent publishes a cryptographically signed capability manifest (Agent Card) declaring its skills, required permissions, and communication endpoint. Agents verify cards before accepting delegated tasks — preventing spoofing and capability misrepresentation.
  • Cross-framework interoperability: A Coordinator built on Google ADK can delegate to a specialist agent built on LangGraph over A2A without bespoke integration code. This is significant for organizations that run multiple teams with different framework preferences, and for evolving the architecture incrementally.
  • Enterprise security: A2A v1.0 includes authenticated agent discovery, enterprise-grade multi-tenancy, and VPC-compatible deployment — required properties for banking infrastructure.
  • Kafka as A2A transport: A2A defines the protocol layer (message envelope, capability discovery, authentication). For high-volume internal messaging within the GKE cluster, Kafka remains the recommended transport implementation, carrying A2A-formatted messages.

Conflict Resolution

When two agents produce contradictory outputs on the same question, the Coordinator runs a synthesis step: both reasoning traces are presented to a high-tier model with the explicit instruction to identify the root cause of disagreement. If the disagreement is resolvable by evidence quality (one agent has higher-confidence retrieval), the higher-confidence output wins. If it is genuinely unresolvable, the decision is escalated to a human, and both outputs plus the synthesis analysis are presented together. Human resolutions are logged and fed back into the labeled training dataset.

State Consistency

Shared state (task status, accumulated results, entity state) lives exclusively in the authoritative data store. Agents do not cache shared state locally between decisions — they query the source of truth on each reasoning step. Optimistic concurrency control prevents race conditions when multiple agents act on the same entity simultaneously. The workflow engine checkpoints task progress so that any agent can be restarted and resume from the last known-good state.

9. Hallucination & Reliability Controls

Grounding Requirements

Any agent output that makes a factual claim must cite the source identifiers from the knowledge base that support that claim. The output validation layer rejects outputs that make factual assertions without citations before they are delivered. Agents are instructed in the rules block to use explicitly hedged language for all claims (“evidence suggests,” “based on retrieved data”) rather than asserting facts as certainties.

Confidence Scoring

Do Not Use LLM Self-Reported Confidence Multiple studies confirm LLM self-assessed confidence is poorly calibrated and nearly useless as a reliability signal. Research found GPT-4 assigned its highest confidence rating to 87% of responses — including factually incorrect ones. Models exhibit minimal variation in self-reported confidence between right and wrong answers. Base confidence scoring exclusively on observable, objective signals.

A composite confidence score is computed from objective, externally verifiable signals. Any signal below its individual threshold contributes to a lower composite score; the composite falling below 0.75 triggers escalation:

  • Citation coverage: Fraction of factual claims in the output supported by a retrieved knowledge base source identifier. Uncited claims directly lower this signal. This is the highest-weight signal.
  • Retrieval quality: Mean similarity score of the top retrieved chunks against the query embedding. Low scores indicate the knowledge base may not cover the requested topic — the agent is reasoning from poor grounding.
  • Validation pass rate: Whether the output passed the schema validator, constraint checker, and PII scanner on the first generation attempt. Re-generation required → lower score.
  • Critic agreement: For decisions reviewed by a Critic agent, the degree of agreement between the Critic’s assessment and the Executor’s output. Disagreement lowers the score proportionally to its severity.
  • Historical task accuracy: Rolling accuracy of this agent on the same task type over the past N runs, derived from the audit log and labeled decision corpus — not from the model’s own self-assessment.

When the composite score falls below the configured threshold (default: 0.75), the decision is automatically flagged for human escalation. Threshold values must be calibrated per task type against the labeled decision corpus; a single global threshold is insufficient for systems handling diverse task categories.

Fact Verification Pipeline

Hallucination Detection Is Partial This pipeline detects a subset of hallucinations: unsupported claims and claims contradicted by the knowledge base. It cannot detect subtle factual errors, plausible but wrong inferences, or hallucinations about topics outside the knowledge base. No current automated approach can reliably detect all hallucinations. Human review of high-stakes outputs remains the only comprehensive mitigation — do not imply otherwise in operator-facing documentation.
  1. A lightweight extraction pass (using a low-cost model) identifies all factual claims in the agent’s output.
  2. Each claim is checked against the knowledge base for supporting or contradicting evidence.
  3. Claims with no supporting evidence: confidence is downgraded; a disclaimer is added; the output is flagged for review.
  4. Claims directly contradicted by the knowledge base: output is rejected; the incident is logged as a hallucination event; the agent retries with additional grounding context injected.

Self-Correction Loop

After initial generation, a second verification pass checks the output for internal consistency: does the conclusion follow from the cited evidence? Do claims contradict each other? Detected inconsistencies trigger a regeneration request with the specific issues highlighted as additional context. After two failed regeneration attempts, the request is escalated — it is never silently delivered in an inconsistent state.

Human-in-Loop Triggers

Hard Rules — Enforced by Harness, Not Prompt The following conditions always route to human review before any action is taken. These triggers are implemented in the harness and cannot be disabled by prompt changes or model outputs.
  • Blended confidence score below the configured threshold
  • Any factual claim with no supporting knowledge base citation
  • Any irreversible action (deletion, external message send, state mutation on critical data)
  • Agent output flagged by the consistency verification pass
  • Three consecutive decisions by the same agent with declining confidence (potential drift signal)
  • Any decision in a use-case category explicitly marked “human-in-loop required” in the agent configuration

10. Security & Governance

OWASP Top 10 for LLM Applications (2025)

Risk Description Mitigations
LLM01: Prompt Injection Adversarial inputs in user data or retrieved content override model behavior Delimiter isolation between instruction and data channels; jailbreak embedding detection on all user input; post-generation constraint validation; user input is data, never instructions
LLM02: Insecure Output Handling Unsafe agent output is executed downstream (XSS, code injection, SQL injection) All agent outputs are structured data validated against a schema; no dynamic code generation without sandbox execution; downstream systems treat agent output as untrusted input
LLM03: Training Data Poisoning Compromised training or fine-tuning data degrades model behavior Fine-tuning datasets curated with human review and full provenance tracking; cryptographic hash verification of training data; prefer managed model APIs over self-trained base models in production
LLM04: Model Denial of Service Excessively long or complex inputs exhaust inference budget Token budget enforcement before API calls; per-tenant and per-agent rate limits; message queue concurrency limits prevent spike amplification
LLM05: Supply Chain Vulnerabilities Compromised LLM provider, dependency, or plugin Provider SOC 2 and security certifications verified; all dependencies pinned with automated vulnerability scanning; skills/plugins require signed manifests before installation (OpenClaw ClawHub model)
LLM06: Sensitive Information Disclosure Model leaks PII, credentials, or confidential training data PII tokenization before data reaches any LLM API; output PII scanner redacts any detected sensitive data before delivery; system prompts never contain real user data
LLM07: Insecure Plugin Design Agent tools expose capabilities beyond their intended scope Explicit tool allowlist per agent archetype enforced by harness; all tool calls logged and audited; no shell execution tools in production; tool parameters validated against schema before execution
LLM08: Excessive Agency Agent takes actions beyond its intended scope or authority Authority levels enforced at the harness layer (not prompt); irreversible actions require explicit human approval; agent service accounts have minimal database and API grants
LLM09: Overreliance Operators or end users blindly trust agent outputs without verification Confidence scores and reasoning traces are always surfaced to human reviewers; UI framing emphasizes “AI-assisted, human-decided”; weekly accuracy reports to governance committee
LLM10: Model Theft System prompts, fine-tuned weights, or proprietary knowledge exfiltrated System prompts stored encrypted at rest and never exposed in API responses; RBAC on prompt registry; network egress controls on agent pods (no direct internet access without allowlist)

Model Governance

  • Model Registry: All approved model versions are tracked in a governed registry with approval timestamps, approver identity, and declared use-case scope. Agents may only invoke models listed in the registry for their specific use case.
  • Rollout Strategy: New model versions follow a canary deployment: 5% traffic → 25% → 100%, with automatic rollback triggered if accuracy or confidence metrics regress more than 5% from the baseline at any stage.
  • Prompt Change Control: System prompt changes require a Git pull request reviewed by the AI platform team. Merged prompt versions are tagged, immutable, and must pass the full regression suite before production promotion.
  • Fine-tuning Governance: Training datasets for any fine-tuning run are reviewed for quality, bias, and provenance before use. Fine-tuned models must complete the same canary rollout as any model change.

Data Isolation & PII

  • Each agent runs with its own service identity; database access is scoped to the minimum required grants for that agent’s declared tool set. Row-level security enforces tenant scoping on every query.
  • PII (names, contact details, credentials, any regulated personal data) is tokenized at system ingestion before any agent ever sees it. Detokenization occurs only in the delivery layer, outside the agent reasoning path.
  • All agent outputs pass through a PII scanner before delivery or logging. Any detected PII is redacted and an alert is emitted.

Audit Trail

Every agent decision produces an immutable audit record persisted to an append-only log before the response is returned. This record includes the agent’s identity, model version, a hash of the prompt used, identifiers of retrieved context chunks, the model’s full output, all tool calls made, validation results, confidence score, and delivery timestamp. Hard deletes on this log are blocked at the database layer.

11. Observability & Monitoring

Agent Health Metrics

Metric Category What Is Measured Alert Threshold
Decision latency P50/P95/P99 end-to-end per agent archetype and model. Realistic P95 targets vary significantly: single agent + tools (3–5s), multi-step agent workflow with retrieval (15–60s), research or synthesis workflow (1–5 min). Set per-agent budgets at design time — a single 3s target across all archetypes is not achievable for multi-agent chains and will produce constant false alerts. P95 > per-agent configured budget × 1.5 → warn; > 3× → page
Error rate Failed, rejected, or timed-out decisions as a fraction of total >2% → warn; >5% → page
Token consumption Tokens used per agent, per slot type, per model Daily budget >80% → warn; >100% → block new requests
Inference cost Cumulative spend per agent, per tenant, per model Monthly burn rate >80% of budget → warn
Confidence score Rolling mean blended confidence per agent archetype 1-hour mean <0.75 → warn; <0.65 → page
Escalation rate Fraction of decisions routed to human review Spike >2× rolling 7-day baseline → investigate
Hallucination events Decisions where the fact-verification pipeline detected a false claim Any occurrence → immediate alert to AI governance team
Cache hit rate Fraction of context assembly requests served from prompt/embedding cache <50% hit rate on stable knowledge base → investigate retrieval configuration

Reasoning Traces & Audit Records

Do Not Log Raw Chain-of-Thought in General Logs Raw CoT traces carry two documented risks: (1) they may contain sensitive data, system prompt internals, or credentials that should not appear in general-purpose logs; (2) recent research shows CoT traces may not faithfully represent actual model reasoning and can be adversarially manipulated — making them an unreliable audit artifact. Store structured decision metadata as the canonical record, not raw reasoning text.

The authoritative audit record for each agent decision is a structured decision artifact — not the raw reasoning trace — persisted to the append-only audit log before the response is returned:

  • Agent identity + model version + prompt hash: Which agent, which model, which exact prompt version governed this decision
  • Retrieved context identifiers + similarity scores: Which knowledge base sources the agent was grounded on, with retrieval quality signal
  • Tool calls and results: Every tool invocation with declared parameters and structured output (never raw execution logs)
  • Final output: The actual response delivered to the caller
  • Validation results: Schema check, citation coverage score, PII scan result, composite confidence score, Critic assessment if applicable
  • Workflow state snapshot: Phase, iteration count, delegation depth, Temporal workflow ID

Raw reasoning traces, when stored for debugging or model improvement purposes, must be written to a separate, encrypted, restricted-access log path. Access requires elevated authorization and is audited. Raw traces must never appear in general application logs, never be returned via API responses, and never be used as the sole basis for audit or compliance evidence.

Model Drift Detection

  • Decision accuracy drift: Weekly automated regression against the labeled decision corpus. Alert if accuracy drops more than 5% from the rolling 30-day baseline.
  • Semantic drift: Monthly embedding similarity analysis comparing the centroid of agent output embeddings this month against the 3-month rolling average. Significant divergence triggers a governance review.
  • Confidence drift: Rolling 7-day mean confidence per agent monitored continuously. A sustained decline from baseline triggers investigation into potential model degradation or knowledge base staleness.

Self-Healing & Auto-Remediation

  • Circuit Breakers: Per-LLM-provider circuit breaker. On consecutive timeouts, the circuit opens and traffic routes to the configured fallback model. Circuit probes every 30 seconds; closes after three consecutive successes.
  • Failed Agent Recovery: After three consecutive validation failures, the agent is automatically restarted with a sanitized (empty) context. If failures continue post-restart, the agent is isolated and an incident is created.
  • Context Corruption Detection: Knowledge base entries retrieved with a hash mismatch are quarantined, excluded from future retrieval, and flagged for manual review. The agent proceeds with reduced context rather than poisoned context.
  • Resource Exhaustion: Vector index size and cache memory are monitored continuously. Alerts fire at 70% capacity to allow proactive scaling before exhaustion.

Tooling Stack

  • Metrics: Prometheus + Grafana. Custom dashboards for per-agent cost tracking and accuracy burn-down.
  • Logs & Traces: Structured JSON to Loki; distributed tracing to Tempo via OpenTelemetry. All LLM spans annotated with model identity, token counts, cost, and confidence.
  • LLM-specific monitoring: Arize AI or equivalent (hallucination detection, embedding drift, decision accuracy tracking). Async export — never on the critical inference path.
  • Prompt management: Git-based registry; Grafana panels show per-prompt-version accuracy and cost trends to guide promotion decisions.

12. Framework Landscape

For a regulated enterprise (large financial institution) deploying on GKE, framework selection must satisfy additional criteria beyond technical capability: active vendor support, enterprise security posture, longevity guarantees, and regulatory acceptability. The frameworks below are divided into two tiers accordingly. Communication protocols (MCP for tool access, A2A for agent-to-agent — covered in Sections 6 and 8) are framework-agnostic and should be adopted regardless of which framework tier you choose.

AutoGen Is in Maintenance Mode Microsoft moved AutoGen to maintenance-only status in early 2026 (security patches and critical bug fixes only; no new features). Do not start new enterprise projects on AutoGen. The official successor is Microsoft Agent Framework (MAF), which reached GA on April 2, 2026. Teams with existing AutoGen deployments should plan migration to MAF.

Enterprise-Grade Frameworks

Actively maintained, production-proven, with enterprise support and security commitments. Acceptable for regulated financial services deployments.

Google ADK
Google · Open-source · Python / TypeScript / Go / Java · Apache 2.0
Deterministic Graph Workflows · GKE-Native · A2A

Google’s enterprise agent framework, GA across four languages in 2026. Workflows are defined as typed directed graphs (nodes = agents or steps; edges = transitions), providing deterministic, auditable execution. Native GKE deployment with official tutorials, Vertex AI integration, Cloud Trace observability, and VPC Service Controls out of the box. Built-in evaluation framework (evalsets, LLM-as-judge). A2A protocol support is first-class. One-command deploy to Vertex AI Agent Runtime, Cloud Run, or GKE Autopilot. The natural choice for GKE-native architectures on Google Cloud.

Microsoft Agent Framework (MAF)
Microsoft · Open-source · Python / .NET · GA April 2026
Graph Orchestration · A2A + MCP · Hosted Agents

The production-grade successor that merges AutoGen’s orchestration capabilities with Semantic Kernel’s enterprise plugin system. GA April 2, 2026. Key capabilities: graph-based multi-agent orchestration, persistent session management, OpenTelemetry-native observability, MCP and A2A protocol support. Hosted Agents on Azure AI Foundry provide sub-100ms cold starts, zero idle cost, built-in identity, and automatic scaling. Supports Python and .NET. For organizations on the Microsoft / Azure stack or with existing Semantic Kernel investments, MAF is the forward path.

LangGraph
LangChain · Python · Production-hardened
Stateful Graph Workflows · A2A

The most widely deployed production orchestration framework in the LangChain ecosystem. Workflows are modeled as stateful graphs where typed state flows between nodes, branches conditionally, and loops based on intermediate results. State checkpointing enables durable, resumable workflows. Native A2A support for agent-to-agent interoperability. LangSmith provides production observability: traces, evals, and prompt versioning. Google Cloud’s Agent Runtime can deploy LangGraph agents directly as A2A endpoints. Best for complex stateful workflows where explicit state management, conditional branching, and auditability are first-class requirements.

CrewAI
CrewAI Inc. · Open-source · Python · Enterprise Edition available
Role-Based Team Coordination · A2A

Enables role-based multi-agent systems defined in natural language. Agents are assigned roles, goals, and backstories; a Crew coordinates their work. The open-source edition is best for rapid prototyping. CrewAI Enterprise adds human-in-loop, persistent memory, SSO, and SLA-backed support — required for regulated production deployments. Native A2A support. Note: CrewAI Inc. is a VC-funded startup; evaluate vendor stability and enterprise support terms before committing to a banking production deployment. Prefer prototyping with CrewAI and graduating to ADK or LangGraph for production if vendor stability is a concern.

Nanobots (Internal)
Internal Runtime · Python + YAML skill packs
HTTP Wrapper + MCP Skills + LLM Router

The internal agent runtime provides a multi-tenant HTTP wrapper, an MCP-based skill system, a pluggable LLM router, and composable YAML-defined skill packs loaded dynamically per tenant. The LLM router selects model tier per request based on configured rules, enabling centralized cost control and model governance across tenants. Designed for the internal platform where multi-tenancy, skill composability, and cost control are primary requirements. Reference implementation: apps/agent_runtime/. Can integrate with LangGraph or ADK for complex workflow orchestration needs.

Design Pattern References

Not for Direct Enterprise Adoption The frameworks below are architecturally interesting but do not meet the maturity, support, or regulatory acceptability bar for a large financial institution. Study their design patterns; do not take a production dependency on them. Where their patterns are valuable, implement the pattern using an enterprise-grade framework above.
Framework Valuable Pattern to Study Why Not for Direct Enterprise Adoption
OpenClaw (Peter Steinberger) Memory Dreaming: background consolidation of short-term memory into long-term vector storage during idle periods. Valuable pattern for cross-session agent coherence. Personal assistant focus; no enterprise support SLA; not hardened for banking compliance or security audit requirements.
DeerFlow (ByteDance) Hierarchical research orchestration: primary orchestrator + parallel sub-agents + synthesis pattern for multi-source research workflows. ByteDance vendor origin raises regulatory and data-sovereignty concerns for regulated Western financial institutions. Adopt the orchestration pattern using LangGraph or ADK instead.
Hermes Agent (Nous Research) Persistent memory + RL training from trajectories: running agent as a training-data generator, producing tool-calling trajectories for fine-tuning. Released February 2026; no enterprise support; not production-hardened. The fine-tuning trajectory pattern is worth studying; the framework itself is not enterprise-ready.
MetaGPT (DeepWisdom) Role-based document-centric workflows: structured artifact production (PRDs → architecture → code → tests) through role-assigned agents. DeepWisdom vendor origin; niche software engineering focus; not general enterprise use. The artifact-chaining pattern translates well to LangGraph or ADK implementations.

Framework Selection Guidance

Priority / Context Recommended Enterprise Framework
GKE-native deployment, Google Cloud stack, multi-language teams Google ADK — native GKE integration, Vertex AI, Cloud Trace, A2A first-class
Microsoft / Azure stack, .NET teams, or migrating from AutoGen / Semantic Kernel Microsoft Agent Framework (MAF) — the supported successor; do not start new projects on AutoGen
Complex stateful workflows, explicit conditional branching, LangChain ecosystem LangGraph — most widely deployed, strong A2A + observability, deployable on ADK Agent Runtime
Rapid prototyping, role-based agent teams CrewAI (Enterprise edition) — validate vendor stability before production commitment
Cross-framework agent interoperability (any combination of the above) A2A Protocol — framework-agnostic; enables ADK agents to delegate to LangGraph agents and vice versa

13. Testing Strategy

Principle LLM agent testing differs fundamentally from deterministic software testing. Tests must tolerate output variation while asserting on structural properties, constraint adherence, and behavioral invariants — not exact string matches. The test suite is a first-class engineering artifact maintained with the same rigor as the agent code itself.

Unit Testing

  • Prompt constraint adherence: Assert that system prompts with adversarial user inputs do not produce outputs that violate declared rules. Use a lightweight mock LLM that returns scripted responses for specific input patterns — not a live API call.
  • Context assembly: Assert correct token counts, slot allocation, overflow handling, and summarization triggers for known input sizes. These are deterministic tests that do not require LLM calls.
  • Retrieval pipeline: Unit tests for embedding search, BM25 indexing, RRF fusion, and re-ranking logic with fixed fixtures.
  • Output validation: Tests for the schema validator, PII scanner, confidence threshold logic, and fact-verification pipeline in isolation.
  • Harness authorization: Assert that tool calls not in the agent’s allowlist are rejected before reaching the execution sandbox.

Integration Testing

  • End-to-end workflow: Full agent decision flow against real backing services (vector DB, Redis, knowledge base) with a mock LLM via a local OpenAI-compatible server.
  • Multi-agent coordination: Test the full delegation chain (Coordinator → Executor → Critic) with mock LLM. Assert correct message routing, delegation depth enforcement, and escalation on conflict.
  • Workflow durability: Kill a worker mid-workflow; verify Temporal replays and resumes from the last checkpoint without data loss.
  • PII non-leakage: Inject synthetic PII into test inputs; assert it is absent from all message queue events, log entries, and agent outputs.

Hallucination Testing

  • Out-of-knowledge-base queries: Present agents with questions whose answers are not in the knowledge base. Assert output contains an explicit uncertainty signal, not a fabricated answer.
  • Adversarial prompt library: 50+ curated jailbreak and injection attempts. Assert all are detected and rejected before reaching the LLM API.
  • Consistency stress tests: Feed contradictory premises; assert the agent detects the contradiction and escalates rather than synthesizing a fabricated resolution.
  • Production regression: Every hallucination event in production generates a new test case. Target: 100% of known failure modes covered.

Performance Testing

  • Latency profiling: Measure P50/P95/P99 per agent archetype under 1×, 2×, 5×, and 10× normal load. Identify which component (inference API, retrieval, orchestration) limits throughput.
  • Token efficiency: Measure actual vs. budgeted token usage per slot. Any agent consistently using more than 90% of its budget needs configuration review.
  • Cache effectiveness: Measure embedding cache and prompt cache hit rates under realistic query distributions. Target >70% for stable knowledge base content.

Canary & A/B Testing

  • New prompt versions are deployed to 5% of traffic. Promotion criteria: accuracy ≥95% of control, latency within 110%, cost within 105%, measured over a minimum 48-hour window.
  • Model version changes: single-replica canary first. Workflow history provides the ground truth for comparison.
  • Retrieval strategy A/B: BM25-only vs. hybrid search — decision quality measured over 2-week windows before committing to a change.

Evaluation Framework

LLM agent evaluation operates at four distinct layers that require separate metrics and tooling. Conflating them produces misleading quality signals — a model that scores well on benchmarks can still produce a failing agent workflow, and a passing workflow can still miss business targets.

Layer What Is Evaluated Key Metrics Tooling
Model Evaluation Raw LLM capability independent of agent scaffolding — instruction following, structured output, reasoning quality Domain-specific benchmark accuracy; token-level calibration; latency; cost per token MLflow LLM Evaluate; provider eval APIs
Agent Evaluation End-to-end agent behavior: tool selection accuracy, retrieval quality, citation coverage, constraint adherence, output correctness per task type Task success rate; tool call accuracy; citation coverage; constraint adherence rate; escalation appropriateness rate MLflow Agent Evaluate; custom judges
Workflow Evaluation Multi-agent coordination quality: delegation accuracy, phase-gate outcomes, conflict resolution, durability under failures, audit trail completeness Workflow completion rate; phase failure rate; MTTR on failures; Temporal workflow success rate; delegation depth distribution Temporal workflow history; Grafana dashboards
Business Evaluation Actual impact on business outcomes: how often does the agent produce something a human approves, acts on, or has to correct? Human approval rate; correction rate; escalation rate vs. baseline; downstream process cycle time; cost per completed workflow including human review Product analytics; MLflow experiment tracking; Grafana
Human Review Cost Dominates In most enterprise agent deployments, the human review workflow — not model inference — is the dominant cost driver. A system with a 20% escalation rate where each human review takes 10 minutes can easily cost more than its model inference bill. Evaluate and optimize the human-in-loop cost explicitly; it belongs in the business evaluation layer above.

LLM-as-Judge

LLM-as-judge uses a separate LLM to evaluate agent output quality at scale, enabling evaluation that would be prohibitively expensive with human reviewers alone. MLflow ships built-in judges for grounding, correctness, safety, relevance, and custom guidelines. Key design rules for reliable judges in production:

  • Use a stronger or equal model as judge: A weaker judge cannot reliably evaluate a stronger model’s output. Use the same tier or higher.
  • Provide reference answers: Pointwise scoring against a known-good reference is significantly more reliable than comparative ranking between two outputs.
  • LLM judges have well-documented biases: Verbosity bias (preferring longer outputs), self-preference bias (preferring outputs resembling their own training style), and sensitivity to prompt template wording are confirmed in research. Use deterministic rule-based metrics alongside LLM judges — not instead of them.
  • Calibrate against human labels: MLflow supports human feedback loops where reviewer labels improve judge accuracy over time. Establish this calibration pipeline before relying on judges for production promotion decisions.
  • LLM judges cannot detect subtle factual hallucinations: Judges share the same knowledge limitations as the models they evaluate. They are reliable for assessing structure, format, relevance, and citation presence — not for verifying factual correctness in specialized domains.

MLflow Integration

MLflow is the recommended evaluation and experiment tracking platform (30M+ monthly downloads; top-rated agent eval platform in 2026). Key integrations for this architecture:

  • Log every agent run as an MLflow experiment with token counts, latency, confidence scores, tool calls, citation coverage, and validation results as tracked parameters.
  • Use MLflow’s LLM Evaluate API to run batch evaluations against the labeled decision corpus before promoting new prompt or model versions to production.
  • Compare prompt and model versions side-by-side in the MLflow comparison UI — accuracy, cost, latency, citation coverage — enabling data-driven promotion decisions.
  • Export human reviewer labels (approve/reject/correct) back into MLflow as ground-truth feedback, closing the loop between production outcomes and offline evaluation accuracy.
  • Use MLflow’s experiment tracking for A/B tests on retrieval strategies, prompt variants, and model tier assignments — the comparison must be evidence-based, not intuition-based.

Test Data & Fixtures

  • Labeled decision corpus: 500+ decisions with human-approved ground-truth outcomes, maintained in a dedicated regression test store and used for the weekly accuracy regression run.
  • Edge case library: Every production incident generates a synthetic test case. Retained indefinitely.
  • Synthetic stress data: Programmatically generated inputs covering boundary conditions, unusual-but-valid inputs, and known failure modes.

14. Deployment & Scaling (GKE)

GKE Node Pools

Pool Machine Type Purpose Min / Max Nodes
default n2-standard-4 Agent harness pods, Kafka consumers, Temporal workers, API gateway 3 / 20
gpu-inference a2-highgpu-1g (A100) vLLM self-hosted inference for open-source models (batch workloads) 0 / 4 (scale-to-zero)
memory-optimized n2-highmem-8 Vector index serving (pgvector), Neo4j knowledge graph, Redis 2 / 6

Auto-Scaling

Standard CPU/memory HPA is insufficient for LLM agent workloads whose resource usage is driven by model inference latency rather than compute. Custom metrics-based HPA is required:

  • Message queue depth per agent topic: Consumer pods scale when backlog exceeds a configured threshold, keeping delegation latency predictable under load spikes.
  • P95 decision latency: Additional replicas are provisioned when latency exceeds 80% of the per-agent budget, providing headroom before SLA breach.
  • GPU nodes: Scale-to-zero for batch inference pools; a ~3-minute cold start is acceptable for non-real-time workloads. Real-time agents must run on always-on compute.

Google ADK Deployment on GKE

For architectures built on Google ADK, Google provides an official GKE deployment path with first-party tooling. ADK agents deploy to GKE via the adk deploy command or the Vertex AI SDK, and on deployment automatically inherit: managed infrastructure, built-in VPC Service Controls, Cloud Trace integration, CMEK encryption, and Vertex AI Agent Runtime management. ADK agents deployed to GKE Autopilot can be exposed as A2A endpoints with automatic Agent Card serving and authenticated access — enabling cross-framework agent interoperability from day one. Refer to the official ADK GKE tutorial (linked in References) for the full deployment specification.

GitOps with ArgoCD

  • All agent Kubernetes manifests, prompt ConfigMaps, and knowledge base update jobs are managed via ArgoCD. Infrastructure state is always derivable from the Git repository.
  • Prompt updates create a new ConfigMap version; ArgoCD’s rolling update strategy delivers zero-downtime hot-reload without agent pod restarts.
  • Model version changes go through a separate ArgoCD Application with a manual sync gate requiring AI governance team approval before applying to production.

Deployment Strategies

  • Agent services: Blue-green deployment via ArgoCD Rollouts. New version is deployed alongside existing; traffic shifts only after health checks pass.
  • LLM model versions: Canary via a feature flag in the harness model router. The flag is controlled by a ConfigMap — no code deployment required for model version changes.
  • Knowledge base updates: Incremental batch jobs that append new entries and retire superseded ones. No downtime; retrieval continues against the stable existing index during the update.

15. Cost Optimization

Token Cost Levers (Highest Impact First)

Lever Mechanism Typical Saving
Model tier routing Route triage, classification, and simple generation tasks to low-tier models (Haiku class). Reserve mid/high tier for tasks where lower tiers demonstrably fail. 60–80% reduction on routed workloads
Prompt caching Provider-side prompt caching (Anthropic, OpenAI) reduces cost on the cached portion of the prompt (typically the stable system prompt + knowledge base context) by ~90%. 30–60% overall, depending on system prompt size and request volume
Context compression Contextual compression removes low-relevance sentences from retrieved chunks before injection. Reduces retrieved context tokens by 30–50% with minimal recall loss. 15–30% on retrieval-heavy agents
Embedding cache Cache query embeddings and retrieval results for semantically similar queries (cosine similarity above threshold). Eliminates redundant embedding API calls and database queries. 10–25% on high-query-volume agents
Self-hosted inference vLLM on GKE GPU nodes for high-volume, latency-tolerant workloads using open-source models. Near-zero per-token cost after GPU amortization. Near 100% API cost elimination for migrated workloads
[ASSUMPTION] Savings estimates are based on observed industry benchmarks and internal measurements. Actual savings depend on workload distribution, query patterns, and knowledge base stability. Cost models should be re-evaluated quarterly against current provider pricing.

Cost Monitoring & Governance

  • Per-agent, per-model, and per-tenant cost counters updated at inference time feed real-time Grafana dashboards.
  • Budget burn-rate alerts fire when projected monthly spend exceeds 80% of budget, leaving time to adjust routing rules before overrun.
  • Weekly cost efficiency reports compare actual token usage against budget per agent and flag agents where actual usage consistently exceeds 90% of budget.
  • Model routing rules are reviewed quarterly against current provider pricing and workload accuracy data to re-optimize the tier assignments.

16. Architecture Diagrams

System Architecture Overview

flowchart TB
    subgraph Clients["Client Layer"]
        UI["Web / Mobile UI"]
        API["API Consumers"]
        MSG["Messaging Channels\n(WhatsApp · Slack · Telegram)"]
    end

    subgraph GW["API Gateway (GKE)"]
        APIGW["Nginx / FastAPI Gateway\nRate Limiting · Auth · PII Tokenization"]
    end

    subgraph Orchestration["Orchestration Layer (GKE)"]
        COORD["Coordinator Agent\nIntent Classification · Routing"]
        TEMPORAL["Apache Temporal\nDurable Workflow Engine"]
    end

    subgraph Harness["Agent Harness (GKE)"]
        CTX["Context Assembly\nRetrieval · Compression · Slot Filling"]
        EXEC["Tool Execution Sandbox\nSchema Validation · Authorization"]
        VALID["Output Validation\nConstraint Check · PII Scan · Confidence"]
    end

    subgraph Agents["Specialist Agent Pool (GKE)"]
        PLANNER["Planner Agent\nHigh-tier model"]
        EXECUTOR["Executor Agents\nMid-tier model"]
        CRITIC["Critic / Verifier\nLow-to-mid-tier"]
        SYNTH["Synthesizer\nHigh-tier model"]
        MEMORY_CONS["Memory Consolidator\nLow-tier · Background"]
    end

    subgraph Memory["Memory and Knowledge (GKE)"]
        REDIS["Redis\nHot Cache · Sessions"]
        PG["PostgreSQL + pgvector\nVector Search · Audit Log"]
        NEO4J["Neo4j\nKnowledge Graph"]
        GCS["Object Storage\nCold Archive"]
    end

    subgraph LLM["LLM Providers"]
        ANTHROPIC["Anthropic API\nHaiku · Sonnet · Opus"]
        OPENAI["OpenAI API\nGPT-4o"]
        VLLM["vLLM (GKE GPU Pool)\nOpen-Source Models"]
    end

    subgraph Obs["Observability"]
        PROM["Prometheus + Grafana"]
        LOKI["Loki — Reasoning Traces"]
        ARIZE["Arize AI — Drift Detection"]
        TEMPO["Tempo — Distributed Tracing"]
    end

    KAFKA["Apache Kafka\nAgent Messaging · DLQ"]

    Clients --> GW --> COORD
    COORD <--> TEMPORAL
    COORD --> KAFKA --> Agents
    Agents --> Harness
    Harness --> Memory
    Harness --> LLM
    Agents --> Obs
    COORD --> Obs
        
Figure 1: Full system architecture — from client layer through agent harness, memory, LLM providers, and observability

Agent Decision Flow (Single Agent)

flowchart LR
    INPUT["Inbound Request"] --> SANITIZE

    subgraph Prep["Harness — Context Assembly"]
        SANITIZE["Input Sanitization\nPII tokenization · Injection detection"]
        RETRIEVE["Retrieval Pipeline\nBM25 + Dense embedding + RRF re-ranking"]
        COMPRESS["Contextual Compression\nRemove low-relevance sentences"]
        ASSEMBLE["Slot Assembly\nEnforce token budgets per slot"]
    end

    SANITIZE --> RETRIEVE --> COMPRESS --> ASSEMBLE

    subgraph Inference["LLM Inference"]
        CACHE["Cache Check\nPrompt cache + Embedding cache"]
        LLM["LLM API Call\nmodel selected by router"]
        CACHE -- "cache miss" --> LLM
        CACHE -- "cache hit" --> VALID
    end

    ASSEMBLE --> CACHE

    subgraph ToolLoop["Tool Execution Loop"]
        TOOL_AUTH["Tool Authorization\nAllowlist check · Schema validate"]
        TOOL_EXEC["Sandboxed Execution\nResult injected then return to LLM"]
    end

    LLM -- "tool call" --> TOOL_AUTH --> TOOL_EXEC --> LLM

    subgraph Validation["Output Validation"]
        VALID["Schema + Constraint Check"]
        FACT["Fact Verification\nClaims vs. knowledge base"]
        PII["PII Scanner"]
        CONSISTENCY["Consistency Verification\n2nd-pass lightweight model"]
        CONFIDENCE["Confidence Threshold"]
    end

    LLM -- "final response" --> VALID --> FACT --> PII --> CONSISTENCY --> CONFIDENCE

    CONFIDENCE -- "pass" --> AUDIT["Append-Only Audit Record"]
    CONFIDENCE -- "below threshold" --> HUMAN["Human Review Queue\nTemporal long-running activity"]
    AUDIT --> RESPONSE["Deliver Response"]
    HUMAN --> RESPONSE
        
Figure 2: Single agent decision flow — context assembly, tool loop, output validation, and human escalation

Multi-Agent Collaboration: Planner to Executors to Critic to Synthesizer

sequenceDiagram
    participant USER as User / Upstream System
    participant COORD as Coordinator
    participant PLANNER as Planner Agent
    participant KAFKA as Kafka
    participant EX1 as Executor A
    participant EX2 as Executor B
    participant CRITIC as Critic / Verifier
    participant SYNTH as Synthesizer
    participant TEMPORAL as Temporal Workflow

    USER->>COORD: Submit complex goal
    COORD->>TEMPORAL: Start durable workflow
    TEMPORAL->>KAFKA: Dispatch to Planner queue
    KAFKA->>PLANNER: Consume goal
    PLANNER->>PLANNER: Decompose into sub-tasks and select tools
    PLANNER-->>KAFKA: Publish execution plan
    KAFKA-->>TEMPORAL: Consume plan

    par Parallel execution
        TEMPORAL->>KAFKA: Dispatch sub-task A to Executor A queue
        KAFKA->>EX1: Consume sub-task A
        EX1->>EX1: Retrieve context, LLM inference, tool calls
        EX1-->>KAFKA: Publish result A
    and
        TEMPORAL->>KAFKA: Dispatch sub-task B to Executor B queue
        KAFKA->>EX2: Consume sub-task B
        EX2->>EX2: Retrieve context, LLM inference, tool calls
        EX2-->>KAFKA: Publish result B
    end

    KAFKA-->>TEMPORAL: Both results received
    TEMPORAL->>KAFKA: Dispatch to Critic queue
    KAFKA->>CRITIC: Consume results A and B
    CRITIC->>CRITIC: Verify claims, check consistency, score confidence
    CRITIC-->>KAFKA: Verification report pass or issues found
    KAFKA-->>TEMPORAL: Consume verification

    alt Issues found
        TEMPORAL->>KAFKA: Re-dispatch failing sub-task with critic feedback
    end

    TEMPORAL->>KAFKA: Dispatch to Synthesizer queue
    KAFKA->>SYNTH: Consume verified results
    SYNTH->>SYNTH: Combine, resolve conflicts, generate final output
    SYNTH-->>USER: Final response plus reasoning trace plus audit trail ID
        
Figure 3: Multi-agent collaboration — Planner decomposes the goal; Executors run in parallel; Critic verifies; Synthesizer combines

Memory Retrieval Interaction

sequenceDiagram
    participant AGENT as Agent (any archetype)
    participant REDIS as Redis (Hot Cache)
    participant PG as PostgreSQL + pgvector
    participant NEO4J as Neo4j (Knowledge Graph)
    participant ASSEMBLER as Context Assembler

    AGENT->>REDIS: Check embedding cache (query hash)
    alt Cache hit (TTL valid)
        REDIS-->>AGENT: Cached chunks
    else Cache miss
        AGENT->>PG: Dense embedding search (top-10 chunks)
        AGENT->>PG: BM25 keyword search (top-10 chunks)
        AGENT->>PG: RRF fusion to top-5 ranked chunks
        PG-->>AGENT: Retrieved chunks + source identifiers
        AGENT->>NEO4J: Graph query (entity relationships, 2-hop)
        NEO4J-->>AGENT: Related entity context
        AGENT->>PG: Query expansion variants (2-3 paraphrases)
        PG-->>AGENT: Additional chunks (merged and de-duplicated)
        AGENT->>REDIS: Cache merged result (configurable TTL)
    end
    AGENT->>PG: Fetch recent episodic decisions (last N entries)
    PG-->>AGENT: Decision history (summarized if needed)
    AGENT->>ASSEMBLER: Combine rules, retrieved chunks, graph context, history, request
    ASSEMBLER->>ASSEMBLER: Enforce slot budgets and apply contextual compression
    ASSEMBLER-->>AGENT: Final prompt within context window
        
Figure 4: Memory retrieval pipeline — cache check, hybrid vector + graph retrieval, compression, and prompt assembly

Context Window Budget (Long-Context Synthesizer Agent)

pie title Context Budget — Synthesizer Agent (30,000 tokens)
    "System Prompt + Rules" : 3000
    "Retrieved Context (multi-source)" : 20000
    "Conversation History" : 2000
    "Output Reservation" : 5000
        
Figure 5: Token budget for a Synthesizer agent — the majority of the window is reserved for multi-source retrieved content

Security & Audit Data Flow

flowchart LR
    subgraph Ingress["Ingress"]
        INPUT["Raw User Input / External Data"]
        PII_TOK["PII Tokenizer\nvault — agents see tokens only"]
    end

    subgraph AgentExec["Agent Execution"]
        HARNESS["Agent Harness\nauthorization · sandboxing · retry"]
        LLM_CALL["LLM API Call\ntokens, never raw PII"]
        TOOL["Tool Execution\nsandboxed · allowlisted"]
    end

    subgraph OutputProc["Output Processing"]
        PII_SCAN["PII Scanner\nredact any leaked sensitive data"]
        SCHEMA_VAL["Schema + Constraint Validator"]
        AUDIT_WRITE["Append-Only Audit Log\nimmutable — deletes blocked"]
    end

    subgraph Delivery["Delivery and Observability"]
        CLIENT["Client / Downstream System"]
        LOKI_LOG["Loki — Reasoning Trace"]
        ARIZE["Arize AI — Async Export"]
    end

    INPUT --> PII_TOK --> HARNESS --> LLM_CALL --> TOOL --> HARNESS
    HARNESS --> PII_SCAN --> SCHEMA_VAL --> AUDIT_WRITE
    AUDIT_WRITE --> CLIENT
    AUDIT_WRITE --> LOKI_LOG
    AUDIT_WRITE --> ARIZE
        
Figure 6: Security-first data flow — PII is tokenized before agents see it; all outputs are validated and audited before delivery

Appendix A: Glossary

Term Definition
Agent Harness The execution layer wrapping every LLM call: context assembly, tool authorization and sandboxing, output validation, retry logic, memory persistence, and observability instrumentation. In 2026, the harness is understood as the primary determinant of agent reliability — more impactful than model selection alone.
Context Engineering The discipline of deciding what information to place in an agent's context window, how much space to allocate to each information type, and how to compress when the budget is exceeded. Recognized as a distinct and more impactful practice than prompt engineering alone.
Harness Engineering The 2026 paradigm encompassing the design of the entire agent execution system — prompt design, context management, tool integration, memory architecture, guardrails, and observability — rather than focusing on prompt wording in isolation.
RAG Retrieval-Augmented Generation. The pattern where relevant documents are retrieved from a knowledge base and injected into the LLM prompt before generation, grounding the model's output in verifiable sources.
CoT (Chain-of-Thought) A prompting technique instructing the model to reason step-by-step before producing a final answer. Improves accuracy on complex multi-step tasks and produces auditable reasoning traces.
Hallucination When an LLM generates plausible-sounding but factually incorrect or fabricated content. The primary reliability risk in production agent systems.
Grounding Constraining LLM outputs to claims verifiable against a known-good knowledge source. The primary mitigation for hallucination.
Prompt Injection An attack where adversarial content in user input or processed documents attempts to override the model's system instructions and cause it to take unintended actions.
Memory Dreaming A pattern originated by OpenClaw where a background consolidator model distills recent short-term memory into long-term vector storage during idle periods, enabling agents to maintain coherence over extended operation without context overflow.
Phase-Gating A workflow control pattern where a complex task is divided into explicit phases (plan, gather, draft, review, deliver). No phase begins until the previous phase's output has been validated. Prevents early-stage errors from propagating through the entire workflow.
RRF (Reciprocal Rank Fusion) An algorithm for combining ranked result lists from multiple retrieval systems (e.g., BM25 keyword search + dense embedding search) into a single unified ranking. Standard approach for hybrid retrieval.
pgvector A PostgreSQL extension that adds vector data types and approximate nearest-neighbor search (ANN) indices for semantic similarity queries. Enables vector retrieval without a separate dedicated vector database.
HNSW Hierarchical Navigable Small World — a graph-based ANN index algorithm offering fast approximate nearest-neighbor search with high recall. The recommended index type for pgvector production deployments.
MCP (Model Context Protocol) An open protocol for connecting LLMs to external tools, data sources, and capabilities in a standardized interface. MCP servers expose tool capabilities via a declared manifest; MCP-aware harnesses discover and invoke them without bespoke integrations. Adopted by Anthropic (Claude), OpenAI, Google ADK, Microsoft Agent Framework, and major enterprise platforms.
A2A (Agent-to-Agent Protocol) Linux Foundation open standard for inter-agent communication across frameworks and vendors. 150+ supporting organizations (Google, Microsoft, AWS, Oracle, Databricks, Snowflake, Salesforce, IBM, SAP). v1.0 released 2026. Integrated into Google ADK, Microsoft Agent Framework, LangGraph, Semantic Kernel, Amazon Bedrock AgentCore, and Azure AI Foundry. Complements MCP: MCP connects agents to tools; A2A connects agents to agents. Financial services is a confirmed production adoption vertical.
Google ADK Google's Agent Development Kit — open-source enterprise agent framework (Apache 2.0), GA 2026 in Python, TypeScript, Go, and Java. Defines workflows as typed directed graphs; native GKE deployment; Vertex AI integration; built-in eval framework; A2A first-class. The recommended framework for GKE-native agent deployments on Google Cloud.
Microsoft Agent Framework (MAF) The production GA successor (April 2026) that merges AutoGen and Semantic Kernel. Supports Python and .NET, graph-based orchestration, session management, MCP + A2A, and hosted deployment on Azure AI Foundry. AutoGen is in maintenance mode — new projects should use MAF.
LLM-as-Judge An evaluation pattern where a separate LLM evaluates the quality of another agent's output. Enables scalable evaluation beyond what human review alone can cover. Has documented biases (verbosity preference, self-preference) and cannot detect subtle factual hallucinations — must be combined with rule-based metrics and calibrated against human labels.
MLflow Open-source AI engineering platform (30M+ monthly downloads) for experiment tracking, model evaluation, and agent evaluation. Recommended platform for logging agent runs, running batch evals against labeled corpora, and comparing prompt/model versions before production promotion.
vLLM A high-throughput open-source LLM inference server using PagedAttention to maximize GPU utilization. Used for self-hosting open-source models on GKE GPU node pools.
Apache Temporal An open-source durable workflow engine. Guarantees exactly-once execution of workflow steps even through process crashes. Recommended for all production multi-step agent workflows requiring durability and audit trails.

Appendix B: Model Comparison

[ASSUMPTION] Specifications and pricing as of June 2026. LLM pricing changes frequently — verify against current provider documentation before making procurement or architecture decisions.
Model Provider Context Window Relative Cost Strengths Weaknesses Recommended Agent Role
Claude Opus 4.8 Anthropic 200K $$$$ Highest reasoning quality; exceptional instruction following; very long context; best for complex synthesis Highest cost and latency Synthesizer, Planner (complex), long-document analysis
Claude Sonnet 4.6 Anthropic 200K $$ Best accuracy/cost balance; strong chain-of-thought; reliable structured output Slower than Haiku Coordinator, Executor (standard), most agent archetypes
Claude Haiku 4.5 Anthropic 200K $ Fastest response; lowest cost; suitable for high-volume triage and classification Lower reasoning depth on complex tasks Triage, routing, PII scanning, memory consolidation, Critic (simple checks)
GPT-4o OpenAI 128K $$$ Reliable; strong multi-modal reasoning; widely vetted in production Higher cost than Sonnet; shorter context than Claude Executor (when provider diversity is required), Planner
Gemini 1.5 Pro Google Vertex AI 1M $$ Largest context window available; GKE-native (low egress cost); competitive pricing Less battle-tested for constrained structured output Long-document analysis, multi-document Synthesizer, fallback
Llama 3.1 / 3.2 70B+ Self-hosted (vLLM) 128K ~ (GPU amortized) Near-zero per-token cost; data never leaves the cluster; fine-tunable; no egress Requires GPU infrastructure ops; lower out-of-box accuracy on complex tasks High-volume batch Executors; domain-specific tasks after fine-tuning

Appendix C: References & Further Reading

  • OWASP Top 10 for Large Language Model Applications (2025 Edition) — owasp.org/www-project-top-10-for-large-language-model-applications
  • Google Agent Development Kit (ADK) — GKE Deployment Guide — cloud.google.com/kubernetes-engine/docs/tutorials/agentic-adk-vertex
  • Google Agent Development Kit Documentation — adk.dev
  • Microsoft Agent Framework 1.0 GA Announcement — devblogs.microsoft.com/agent-framework
  • Microsoft Agent Framework Overview — learn.microsoft.com/en-us/agent-framework/overview
  • A2A Protocol v1.0 — Linux Foundation AI & Data — a2a-protocol.org
  • A2A Protocol: One Year Milestones (150+ Organizations) — linuxfoundation.org, April 2026
  • LangGraph Production Patterns — langchain-ai.github.io/langgraph
  • Agent Harness Engineering: The Rise of the AI Control Plane — Adnan Masood, Medium, April 2026
  • From Prompts to Harnesses — Four Years of AI Agentic Patterns — bits-bytes-nn.github.io, April 2026
  • Agent Harness for Large Language Model Agents: A Survey — Preprints.org, April 2026
  • Lewis et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
  • Ji et al. (2023). "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys.
  • Yao et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023.
  • Apache Temporal Workflow Engine Documentation — docs.temporal.io
  • pgvector: Open-source vector similarity search for PostgreSQL — github.com/pgvector/pgvector
  • NIST AI Risk Management Framework (AI RMF 1.0) — nist.gov/artificial-intelligence
  • Anthropic Responsible Scaling Policy — anthropic.com/rsp
  • vLLM: Easy, Fast, and Cheap LLM Serving — github.com/vllm-project/vllm

Share this: