System Architecture

Agentic Workflow System Design

A production-ready reference architecture for multi-agent applications. Features LangGraph stateful orchestration, conditional HITL routing, isolated tool execution, and unified safety constraints.

Web App

Mobile App

API Gateway & Auth

Authentication & tenant context

Input Guardrail

Safety, PII, scope & permission checks

LangGraph Supervisor

Reads graph state & routes intent

Shared Graph State

Session state, context & tool results

Knowledge Agent

Vector RAG & docs

RAG Pipeline

Query Rewrite

Retriever

Rank Chunks

Generate Answer

Citations

Task Execution

Action execution & writes

Tool Execution

Plan Action

Permission Check

MCP / API Call

Result Normalization

Account Manager

Billing & seat limits

State & Entitlements

Billing & Seats

Entitlements Check

Account State

Retention Agent

Churn risk mitigation

Escalation Route

Detect Churn Risk

Prepare Escalation

Output Validation & Guardrails

Policy, hallucination, citation & PII checks

Route by Source Agent & Risk Level

Direct Pathif Knowledge / Account Agent → passes Safety Level

User Response

Instant resolution via UI

Human Handoff Routeif Task Agent (MCP) → High Risk
or Retention Agent → Escalation

ServiceNow / Salesforce

Human queue & CRM logging

User Response

Guided handoff / CTA

Continuous System Tracking

Observability

Traces, Latency & Cost

Evaluation

Hallucination & Quality Checks

Auditability

Action & Tool Execution Logs

Why This Architecture Matters in Production

Each layer reduces a real failure mode in enterprise AI agent deployments.

Trigger Layer

Handles user requests from web, Slack, API, webhook, or scheduled jobs. Examples: Next.js, Slack App, API Gateway, Event Grid, Kafka.

Security Layer

Protects access, identity, tenant data, and risky actions. Examples: OAuth, RBAC/ABAC, Azure AD, Okta, policy engine, secrets manager.

Orchestration Layer

Controls routing, state transitions, retries, agent handoffs, and human approval. Examples: LangGraph, Semantic Kernel, AutoGen, custom state machine.

Retrieval Layer

Brings trusted business context into the agent before answering. Examples: vector DB, Azure AI Search, OpenSearch, Pinecone, Databricks, Confluence, Glean.

Tools Layer

Executes approved actions through controlled tool interfaces. Examples: MCP servers, REST APIs, internal services, ServiceNow, Salesforce, Jira, Datadog.

State & Memory Layer

Stores session state, graph checkpoints, tool results, user context, and approval status. Examples: LangGraph checkpointing, Redis, Postgres, Cosmos DB.

Observability Layer

Tracks logs, traces, latency, cost, failures, and tool-call history for debugging and audits. Examples: LangSmith, Datadog, Grafana, CloudWatch, Azure Monitor.

Evaluation Layer

Measures answer quality, tool accuracy, hallucination risk, and workflow drift. Examples: LangSmith evals, Databricks MLflow, RAGAS, custom test sets, offline grading.

System Design Document

Enterprise Multi-Agent AI Assistant

Comprehensive requirements, architecture logic, and operational metrics gathered for production deployment.

1Clarify the Problem

We are designing an enterprise AI assistant for a web and mobile experience. The assistant helps users ask questions, retrieve trusted business knowledge, understand account or subscription details, perform approved actions, and escalate complex issues to human support when needed.

The system should support both self-service answers and action-oriented workflows, while keeping security, accuracy, permission checks, and auditability in place.

Example User Requests

“Why was my billing seat limit exceeded?”

“What does my subscription include?”

“Can you explain this policy?”

“Create a support case for this issue.”

“Update my account preference.”

“Cancel this subscription, but confirm before taking action.”

“Show me the status of my recent support case.”

The Goal: Design a secure, scalable, reliable, observable, and production-ready multi-agent AI assistant using web/mobile entry points, API gateway, authentication, LangGraph-style orchestration, RAG, MCP/tool execution, guardrails, and human-in-the-loop escalation.

2Define Functional Requirements

Functional requirements describe what the system must do.

Accept user requests from web app, mobile app, or API.

Authenticate user and identify role, tenant, account, subscription, and session.

Validate request using input guardrails.

Detect unsafe prompts, prompt injection, unsupported scope, sensitive data, and unauthorized intent.

Classify the user's intent.

Route request to correct specialist agent.

Support knowledge-based Q&A using trusted documents and RAG.

Generate grounded answers with citations when using enterprise knowledge.

Support account, billing, subscription, entitlement, and seat-limit questions.

Execute approved actions through MCP tools, internal APIs, or enterprise services.

Check permissions before every read or write action.

Require user confirmation or human approval for risky write actions.

Escalate unresolved, risky, or sensitive cases to human support or CRM.

Return a final validated response to the user.

Store session state, tool results, approval status, traces, and audit logs.

Support feedback collection for quality improvement.

Support web and mobile responsive behavior.

3Define Non-Functional Requirements

Describing how well the system must work across performance, security, and scale.

3.1 Performance and Latency

For a web/mobile AI assistant, we separate normal UI performance from AI workflow performance.

Web & Mobile App Performance

Area	Ideal Target	Acceptable Target	Notes
Initial page/app shell load	≤ 2.5 seconds LCP	≤ 4 seconds	Aligns with Core Web Vitals loading guidance.
Server response / TTFB	≤ 800 ms	≤ 1.8 seconds	Good backend responsiveness target.
UI tap/click feedback	≤ 100 ms	≤ 200 ms	User should feel the UI reacted immediately.
Interaction responsiveness	≤ 200 ms INP	≤ 500 ms	Good Core Web Vitals responsiveness target.
Layout stability	CLS ≤ 0.1	CLS ≤ 0.25	Avoid content jumping during load.
Cached/revisited page reload	≤ 1 second	≤ 2 seconds	Use caching, CDN, and client-side hydration.
API read request	300–800 ms	≤ 1.5 seconds	For normal account/profile/status reads.
API write request	≤ 1.5 seconds	≤ 3 seconds	For preference updates or simple case creation.

AI Assistant Response Performance

Request Type	Ideal Target	Acceptable Target	UX Behavior
Simple greeting/help prompt	≤ 1 second	≤ 2 seconds	Return immediately.
Basic account/status answer	2–4 seconds	≤ 6 seconds	Show loading state if needed.
RAG-based answer with citations	4–7 seconds	≤ 10 seconds	Stream partial response or show progress.
Tool/MCP action workflow	5–9 seconds	≤ 15 seconds	Show step status: checking permission, calling tool, validating result.
High-risk action requiring approval	Depends on approval	Not fully automated	Ask user/human for confirmation.
Human escalation	Case created ≤ 10s	≤ 20 seconds	Show ticket/case reference if available.

Note“For UI performance, I would target Core Web Vitals: LCP under 2.5 seconds, INP under 200 milliseconds, CLS under 0.1, and TTFB under 800 milliseconds. For AI workflows, I would separate simple answers from tool-heavy workflows. A simple answer should return within 1–2 seconds, RAG answers should ideally complete in 4–7 seconds, and tool-based workflows can take 5–9 seconds with progress indicators and streaming.”

3.2 Availability & Reliability

Area	Target
Web/mobile frontend	99.9% or higher
API/backend availability	99.9% or higher
Critical account/action services	99.9% or higher
AI model fallback path	Required
Tool/API retry support	Required
Graceful degradation	Required

If the AI model fails, retry or use a fallback model.
If RAG retrieval fails, return a safe fallback.
If API fails, retry with backoff. If it still fails, escalate.
If confidence is low, ask clarifying question.

3.3 Scalability

Area	Requirement
Users	Support many concurrent web/mobile users.
Tenants/accounts	Isolate data and scale per tenant.
RAG queries	Scale vector/search independently.
Tool calls	Use queueing, rate limits, and retries.
Traffic spikes	Use autoscaling and CDN caching.

“I would scale the frontend through CDN/edge caching, scale the API layer horizontally, scale RAG independently, and isolate long-running tool workflows through queues/async workers.”

3.4 Security Requirements

Use authentication for every user.
Use RBAC/ABAC for authorization.
Enforce tenant isolation.
Permission checks before retrieval and tool execution.
Never expose secrets to the model.
Use a tool allowlist & detect prompt injection.
Sanitize inputs/outputs and encrypt data.

Risky Actions (Require Approval):

cancellation, refund, payment change, downgrade, entitlement change, ownership change, CRM escalation with sensitive data.

3.5 Accuracy & Grounding

Use RAG for policy, docs, and product answers.
Retrieve only documents the user can access.
Use reranking or top-chunk selection.
Provide citations and validate answers.
Detect hallucination risk and ask clarifying questions.

Metric	Target
Citation coverage	≥ 95%
Unsupported answer rate	< 2–5%
Tool-call success rate	≥ 95% for stable tools
Human escalation accuracy	≥ 90%

3.6 Observability

Track: Request ID, Intent, Agent route, Model/Retrieval/Tool latency, Cost, Errors, Feedback.

Tools: LangSmith, Datadog, CloudWatch, MLflow

3.7 Auditability & Compliance

Requirements: Log every tool call, write action, approval decision, user confirmation. Track citations. Support retention policies.

3.8 Maintainability

Modular architecture: Agents separated by responsibility, versioned prompts, registered tools, refreshable indexes.

3.9 Cost Efficiency

Use caching, route simple intents to smaller models, stream responses, track cost per workflow, set budget alerts.

4Users & Entry Points

Users

Customers & Employees
Support agents
Account managers
Admin / Ops teams

Entry Points

Web application
Mobile application
API Gateway
Webhook / Event Job

5High-Level Architecture

See the interactive visual diagram at the top of this page.

The main idea is that the system first validates the user and request, then uses a stateful orchestrator to route work to the right specialist agent. Knowledge requests go through RAG. Action requests go through MCP tools or internal APIs. Risky workflows require permission checks and human approval. The final response is validated before being returned, and all steps are logged.

6End-to-End Request Flow

Example Request Trigger

“Why was my billing seat limit exceeded, and can you create a support case?”

1.User sends a message through the web or mobile app.

2.Request reaches the API Gateway.

3.Auth layer validates user identity, role, tenant, and account context.

4.Input Guardrail checks prompt injection, PII risk, unsafe content, and permission boundaries.

5.LangGraph StateGraph Supervisor receives the request.

6.Supervisor reads shared graph state, previous conversation context, and session metadata.

7.Supervisor classifies the intent as both an account question and a support action.

8.Account Manager Agent checks billing, subscription, seat usage, entitlement, and account state.

9.Knowledge Agent retrieves billing rules or product policy using RAG if needed.

10.Task Execution Agent prepares a plan to create a support case.

11.System performs permission checks before case creation.

12.If allowed and low risk, the MCP Client calls the approved CRM or case-management tool.

13.If risky, the system asks for confirmation or routes to human approval.

14.Tool result is normalized and returned to the supervisor.

15.Output Validation layer checks final answer for citations, hallucination risk, policy compliance, PII, and action confirmation.

16.User receives a final answer with the reason for the seat-limit issue and the support case status.

17.Logs, traces, tool calls, state updates, and evaluation data are stored.

Requirement Summary

For this system, I would first gather functional requirements around web/mobile user input, authentication, intent routing, RAG-based answers, tool execution, permission checks, human approval, escalation, and final response validation.

For non-functional requirements, I would define clear targets: initial page load under 2.5s, TTFB under 800ms, interaction response under 200ms, simple AI answers in 1–2s, RAG answers in 4–7s, and tool workflows in 5–9s with progress indicators. I would also define availability, scalability, security, accuracy, observability, auditability, maintainability, and cost-efficiency requirements.