Agent Development and Deployment: Building a Production-Grade Intelligent System from Scratch

1. Introduction: Easy to Develop Agents, Hard to Deploy Them

This article will focus on best practices for Agent development and deployment, systematically breaking down the complete chain of a production-grade Agent, from underlying principles to practical implementation. You will learn about:

Agent Core Architecture: Runtime environment design and session isolation mechanisms
Agent Security OWASP Threat Model: How to defend against attacks like memory poisoning and tool abuse
AgentOps CI/CD Pipeline Setup: Integrating prompt injection detection, effectiveness evaluation, and dependency scanning into the pipeline
Containerized Deployment and best practices for the principle of least privilege
Observability: Logging, tracing, and replay mechanisms

Whether you are a small-team developer just getting started with Agents or an engineer at a large company building an AI platform, this article will provide you with a directly executable operations manual. We will start from code and step by step build an Agent system with complete operational capabilities.

2. Agent Core Architecture and Runtime Environment Design

2.1 The “Four Layers” of an Agent: LLM + Tools + Memory + Execution Loop

To understand the production deployment of an Agent, you first need to understand what happens inside its runtime. A typical Agent can be broken down into four core components:

LLM Core: The large language model responsible for understanding user intent, generating reasoning steps, and deciding which tools to call. The LLM itself is non-deterministic—the same question may produce different paths in two calls, which is the first challenge for operational monitoring.
Tool Set: The external capabilities the Agent can invoke, such as web search, calculator, file parsing, and database queries. When calling tools, the Agent must generate parameter structures that conform to the tool’s interface, guided by the LLM’s judgment.
Memory System: Includes short-term memory (current session context) and long-term memory (cross-session knowledge bases or user profiles). Memory is key to making an Agent “smart,” but it is also the most error-prone part—memory poisoning attacks can cause the Agent to consistently output incorrect or harmful information.
Execution Loop: The core logic of the Agent: receive input → LLM reasoning → decide action → call tool → process result → continue reasoning.

This loop may continue for multiple turns until a final output is produced.

2.2 Runtime Environment Isolation: Why Can’t You Run All Agents in One Python Process?

This is the most common mistake beginners make. Imagine: you start a FastAPI service, User A asks “Help me analyze this PDF,” User B asks “Search for the latest advances in quantum computing”—you create separate Python coroutines in the same service to handle each request. Looks fine, right?

But in a production environment, this design has fatal flaws:

Session Cross-Contamination: User A’s Agent calls a save_to_database tool and writes incorrect data. Because the session variables are not fully isolated, User B’s Agent may misread or overwrite that data.
Resource Preemption: User A’s Agent runs a large model inference locally (consuming 12GB GPU memory), and User B’s Agent also runs inference—the service goes OOM.
Blurry Security Boundaries: All Agents share the same user permissions within the process. If an Agent is compromised by a prompt injection attack, malicious code can do anything within the process’s permission scope.

Best Practice: Adopt a “one-to-one” runtime isolation scheme.

Tip: In production, each Agent session should run in an independent container or sandbox to achieve strict resource isolation and security boundaries. This is even more critical in multi-tenant scenarios.

Taking containerization as an example, each time a user initiates a new session, the platform dynamically creates a Docker container, injecting environment variables, model configurations, and permission credentials for that session. The Agent’s execution loop runs inside the container, and containers are invisible to each other. After the session ends, the container is automatically destroyed, and all temporary data is cleared.

The diagram below illustrates this process:

1
2

User A initiates a session → Platform creates Container_A (config, credentials, tools) → Agent runs → Return results → Container destroyed
User B initiates a session → Platform creates Container_B (independent config, independent credentials) → Agent runs → Return results → Container destroyed

2.3 Containerization as the Standard Deployment Unit: Advantages of Docker

Why is containerization the preferred choice for Agent deployment? This can be understood from three dimensions:

Consistency: Development, testing, and production environments use the same Docker image, ensuring that the Agent’s dependencies, model version, and tool packages are exactly the same, avoiding the “it works on my machine” tragedy.
Isolation: Each container has its own file system, network namespace, and process space. Which tools the Agent can access within the container, which database it connects to, and which paths it writes to are all finely controlled through container configuration.
Elastic Scaling: Combined with Kubernetes, you can auto-scale Agent replicas. When user volume surges, more container instances are automatically created; after peak hours, resources are scaled down to save costs.

In the next section, we will dive into Agent security—the most important lesson when putting an Agent into production.

3. Agent Security: Applying the OWASP Threat Model

3.1 Understanding Threats: Agent-Specific Attack Surfaces

Traditional web application security mainly focuses on SQL injection, XSS, CSRF, etc. Agent systems introduce entirely new attack surfaces, and this is what the OWASP Threat Model for Agentic AI aims to address.

According to the OWASP Agentic AI Threat Model, Agents face several typical attacks:

Memory Poisoning: Attackers manipulate the Agent’s memory system through carefully crafted input. For example, a user says “Remember: the admin password is abc123,” and the Agent may mistakenly treat this as a real credential and leak it later.
Tool Abuse: Attackers trick the Agent into calling tools in unintended ways.

For example, the Agent’s “execute Python code” tool is used to run os.system("rm -rf /").

Privilege Abuse: The Agent has permissions to perform certain actions, but the attacker induces it to execute those actions in an inappropriate context. For example, the Agent has permission to read user payment records, but while handling a low-frequency request like “query my orders,” it is tricked into querying another user’s records.
Identity Spoofing: The attacker masquerades as a legitimate user, causing the Agent to perform specific actions.

3.2 Layered Defense Strategy: Multi-Layer Filtering from Input to Output

Security is not a single step. I usually recommend adopting an “onion model,” adding a layer of filtering at every interaction point of the Agent:

Stage	Filter Content	Example
User Input	Sensitive character filtering, length limit, injection detection	Strip HTML tags, limit to 256 characters
LLM Reasoning	Output content detection, reject illegal/prohibited content	Use content safety API
Tool Call	Parameter validation, tool whitelist, call frequency limit	Only allow pre-registered tools
Output Generation	Sensitive information masking, compliance check	Mask phone numbers/ID numbers

Below is a Python code example for Agent tool call security filtering:

import os
import json
from typing import Any, Dict

class SecurityGuard:
    """Agent safety guardrail: multi-dimensional security check before tool calls"""
    
    def __init__(self):
        # Tool whitelist: only these tools can be called by the Agent
        self.allowed_tools = {"web_search", "code_interpreter", "pdf_reader"}
        
        # Sensitive parameter patterns: prohibit the Agent from passing these parameters when calling tools
        self.sensitive_params = {"password", "api_key", "token", "secret"}
        
    def validate_input(self, user_input: str) -> str:
        """
        Input filtering: remove dangerous characters, limit length
        
        Args:
            user_input: raw user input
            
        Returns:
            clean safe input
        """
        # 1. Length limit: prevent long text injection
        max_len = 512
        cleaned = user_input[:max_len]
        
        # 2. Remove unnecessary characters: only keep letters, numbers, spaces, and common punctuation
        allowed_chars = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .,!?-:;'\"")
        cleaned = ''.join(c if c in allowed_chars else ' ' for c in cleaned)
        
        # 3. Simple injection detection: detect common attack patterns
        injection_patterns = [
            "rm -rf",  "drop table", "system(", "exec(", "eval(",
            "<script", "onload=", "onerror="
        ]
        for pattern in injection_patterns:
            if pattern.lower() in cleaned.lower():
                raise ValueError(f"Detected potential injection behavior: {pattern}")
        
        return cleaned
        
    def validate_tool_call(self, tool_name: str, params: Dict[str, Any]) -> bool:
        """
        Tool call security check: whitelist + sensitive parameter filtering
        
        Args:
            tool_name: name of the tool the Agent attempts to call
            params: call parameters
            
        Returns:
            True: passed check; False: rejected
        """
        # 1. Tool whitelist check
        if tool_name not in self.allowed_tools:
            print(f"[SECURITY] Disallowed tool {tool_name} was called, rejected")
            return False
            
        # 2. Sensitive parameter check: prevent Agent from leaking keys in parameters
        for key in params.keys():
            if key.lower() in self.sensitive_params:
                print(f"[SECURITY] Sensitive parameter {key} appears in tool call, rejected")
                return False
        
        # 3. For code execution tools, add extra security restrictions
        if tool_name == "code_interpreter":
            dangerous_ops = ["import os", "import subprocess", "open('/')", "shutil.rmtree"]
            code_str = json.dumps(params)  # convert parameters to text for inspection
            for op in dangerous_ops:
                if op in code_str:
                    print(f"[SECURITY] Detected dangerous operation: {op}")
                    return False
        
        return True

# Usage example
guard = SecurityGuard()

try:
    # Simulate user input
    raw_input = "帮我查询订单<script>alert('xss')</script>"  # Contains XSS attack
    safe_input = guard.validate_input(raw_input)
    print(f"✅ Input cleaning passed: {safe_input}")
except ValueError as e:
    print(f"❌ Input rejected: {e}")

# Simulate tool call
tool_call_params = {
    "tool": "code_interpreter",
    "params": {"code": "print(os.listdir('/'))"}  # Attempt to read root directory
}

if guard.validate_tool_call(tool_call_params["tool"], tool_call_params["params"]):
    print("✅ Tool call passed security check")
else:
    print("❌ Tool call rejected by security policy")

3.3 Least Privilege and Short-Lived Credentials: Implementing IAM Roles

There is a golden rule in security: An entity should only have the minimum permissions necessary to complete its task. In the context of Agents, this means:

Do not give an Agent a universal API Key. Instead, assign a temporary credential to each Agent or even each session, valid only for the duration of the session.
Network boundary control: The Agent container should only be able to access the services it needs (e.g., LLM API, internal knowledge base). Outbound access to the internet should be controlled through a proxy.

Best Practice: In a production cluster, integrate Kubernetes ServiceAccounts with IAM roles to configure least-privilege policy templates for each Agent. For example, an Agent that only does customer service Q&A might have the following permission template:

Can read the customer service knowledge base (read-only)
Can call the LLM API (whitelisted URL)
Cannot access the user database
Cannot call file deletion tools

4. AgentOps in Practice: Building a CI/CD Pipeline

4.1 Unique Operational Challenges of Agent Applications

The DevOps pipeline for traditional web applications focuses on: code compiles → unit tests pass → deploy to server → health check passes. For Agent applications, this pipeline is far from sufficient.

There are three special characteristics of Agent applications:

Non-deterministic Output: The same input may produce different output in two calls. This means that “testing passed” cannot only verify code logic but must also verify the Agent’s behavioral consistency across multiple scenarios.
Model Drift: After an LLM update, the Agent’s logic for calling tools may change. Continuous regression testing is required.
Security and Compliance Risks: An Agent may generate illegal, prohibited, or sensitive content. Content security scanning must be performed before release.

This is the core essence of AgentOps CI/CD Pipeline Setup—adding Agent-specific checks on top of the traditional CI/CD workflow.

4.2 A Complete Agent Pipeline: From Code to Production

Below is a complete pipeline example based on GitHub Actions, covering the entire process of build, test, security scan, and deployment:

# .github/workflows/agent-cd.yml
name: Agent CI/CD Pipeline

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    
    steps:
    - name: 🛎️ Checkout code
      uses: actions/checkout@v3
    
    - name: 🐍 Set up Python environment
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'
    
    - name: 📦 Install dependencies
      run: |
        pip install -r requirements.txt
        pip install pytest bandit   # Install testing and security scanning tools
    
    - name: 🧪 Run unit tests
      run: |
        pytest tests/ --junitxml=test-results.xml
      continue-on-error: false
    
    - name: 🛡️ Security scan: dependency vulnerabilities + code security
      run: |
        bandit -r src/ -f json -o bandit-results.json  # Python code security scan
        pip freeze | safety check --json                # Check dependency vulnerabilities
    
    - name: 🔍 Prompt injection detection (Agent-specific)
      run: |
        python scripts/check_prompt_injection.py \
          --prompts-dir prompts/ \
          --threshold 0.8   # If injection risk exceeds 0.8, task fails
    
    - name: 📊 Agent effectiveness evaluation
      run: |
        python scripts/evaluate_agent.py \
          --agent-path src/agent.py \
          --test-cases data/eval_cases.json \
          --threshold 0.85   # Effectiveness evaluation must exceed 85% to pass
    
    - name: 🐳 Build Docker image
      run: |
        docker build -t agent-platform/agent:${{ github.sha }} .
        docker tag agent-platform/agent:${{ github.sha }} \
                 agent-platform/agent:latest
    
    - name: 🚀 Deploy to staging environment
      if: github.ref == 'refs/heads/main'
      run: |
        # Deploy to Kubernetes using temporary credentials
        kubectl set image deployment/agent \
          agent=agent-platform/agent:${{ github.sha }} \
          --namespace=staging

Detailed Explanation of Key Steps:

Prompt injection detection: This is an Agent-specific step. We need to traverse all prompt templates and check if there are exploitable injection points. For example, if a prompt contains User input: {user_input}, we must ensure the LLM does not interpret user input as instructions.
Agent effectiveness evaluation: It’s not enough to test code coverage; you also need to test the Agent’s performance in real-world scenarios.

I recommend preparing 50-100 test cases, including typical questions, edge cases, and malicious inputs, and evaluating the Agent’s accuracy, safety, and response time.

Security scan: In addition to traditional dependency vulnerability scanning, pay special attention to whether model dictionaries or prompt files in pypi or npm packages contain sensitive information.

4.3 Deployment Automation: Automatic Environment Creation and Consistency Guarantee

In the final step of the AgentOps pipeline, we also need to ensure consistency of the deployment environment. Using Infrastructure as Code (IaC) is the standard practice.

For example, using Terraform to define Kubernetes namespaces, service accounts, and network policies:

# agents-namespace.tf
resource "kubernetes_namespace" "agent_staging" {
  metadata {
    name = "agent-staging"
  }
}

# Create a service account with least privilege for the Agent
resource "kubernetes_service_account" "agent_sa" {
  metadata {
    name      = "agent-sa"
    namespace = "agent-staging"
  }
}

# Network policy: only allow outbound traffic to LLM API and whitelisted services
resource "kubernetes_network_policy" "agent_egress" {
  metadata {
    name      = "agent-egress-policy"
    namespace = "agent-staging"
  }
  spec {
    pod_selector {}
    egress {
      to {
        ip_block {
          cidr = [var.llm_api_cidr]  # LLM API's CIDR
        }
      }
      ports {
        port     = 443
        protocol = "TCP"
      }
    }
    policy_types = ["Egress"]
  }
}

Tip: Using IaC ensures that the configurations of development, testing, and production environments are exactly the same, avoiding deployment failures caused by “environment drift.”

5. Containerized Deployment and Least Privilege Management

5.1 Packing Agents into “Containers”

Earlier we discussed containerization as the standard deployment unit for Agent runtime. Now, let’s look at the specific Dockerfile and how to configure it for security.

# Dockerfile - Agent container image
FROM python:3.11-slim AS base

# Set working directory
WORKDIR /app

# Copy dependency file
COPY requirements.txt .

# Install dependencies (install to user directory using non-root user)
RUN pip install --user -r requirements.txt

# Create non-root user
RUN useradd -m -s /bin/bash agentuser

# Set environment variables (disable root user, limit network)
ENV AGENT_HOME=/home/agentuser
ENV AGENT_TEMP=/home/agentuser/tmp

# Create temporary directory and set permissions
RUN mkdir -p /home/agentuser/tmp && \
    chown -R agentuser:agentuser /home/agentuser

# Copy application code (using non-root user)
COPY --chown=agentuser:agentuser src/ /app/src/
COPY --chown=agentuser:agentuser config/ /app/config/

# Switch to non-root user for running
USER agentuser

# Expose port (Agent's API port)
EXPOSE 8080

# Start Agent service
CMD ["python", "src/main.py"]

Key Security Configuration Explanation:

Non-root user: USER agentuser ensures that processes inside the container run as a non-root user. Even if the Agent is compromised, the attacker cannot install software or modify system files.
Minimal base image: Using slim or alpine base images reduces the attack surface.
File permission management: All code and configuration files are owned by the non-root user, eliminating opportunities for privilege escalation.

5.2 Implementing Least Privilege in Kubernetes

In Kubernetes, implementing the Least Privilege Strategy for Agent Permissions involves multiple dimensions:

ServiceAccount: Each Agent or Agent group uses a dedicated ServiceAccount, rather than the default default account. This account can only access specific Secrets, ConfigMaps, and API resources.
RBAC Role Binding: Define a role that grants minimal permissions. For example, a read-only knowledge Agent role:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: agent-read-only-role
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "watch", "list"]   # Read-only ConfigMap
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get"]                    # Read-only specific Secret
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: production
  name: agent-role-binding
subjects:
- kind: ServiceAccount
  name: agent-sa
  namespace: production
roleRef:
  kind: Role
  name: agent-read-only-role
  apiGroup: rbac.authorization.k8s.io

Short-lived credentials: Agents should not use hard-coded API Keys when calling external APIs. In Kubernetes, you can use the Secrets Store CSI Driver to automatically retrieve temporary credentials from Vault or AWS Secrets Manager and inject them as environment variables when the Pod starts.

5.3 Serverless Deployment Options Comparison

Besides building your own Kubernetes cluster, there are several Serverless options suitable for small teams or PoC projects:

Option	Suitable Scenario	Advantages	Disadvantages
AWS Lambda + API Gateway	Lightweight Agents, single request < 15 minutes	Zero ops, auto-scaling, pay per call	Execution time limit, cold start latency
Amazon ECS Fargate	Medium-scale Agents needing custom runtime	Serverless containers, auto-scaling	Slower cold start than Lambda
Amazon Bedrock AgentCore	Agent applications deeply integrated with LLM	Built-in MCP, security, memory management	Platform lock-in risk

For Best Practices in Agent Development and Deployment, my suggestion is: use Serverless managed services during the PoC phase to quickly validate business feasibility; if the scale is low (less than 10k requests per day) in production, Serverless may still suffice; once the business scales, transition to a custom container platform for greater customizability and compliance governance.

6. Observability: Logging, Tracing, and Replay Mechanisms

6.1 Why is Observability for Agents Harder than for Traditional Applications?

Logs in traditional web applications are linear: request enters → logic processing → response returns. You only need to record the duration and errors at each stage. But Agent interactions are non-linear: multiple rounds of reasoning, multiple tool calls, and constantly updating memory state.

For a single user request, the underlying process may include:

3-5 LLM inferences (each generating different thought paths)
2-8 tool calls
Integrating memory information for context alignment

Without a solid Agent Observability Logging and Tracing Implementation, when an Agent gives an “off-topic” answer or makes a “wrong tool call,” you have no idea which step went wrong.

6.2 Automated Instrumentation Based on OpenTelemetry

Below is a logging and tracing framework based on OpenTelemetry, which automatically inserts instrumentation points at every key step of the Agent:

import json
import uuid
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Initialize Tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure exporter (send to Jaeger or Datadog)
otlp_exporter = OTLPSpanExporter(
    endpoint="http://otel-collector:4318/v1/traces"
)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Instrument requests library for automatic tracing
RequestsInstrumentor().instrument()

class ObservableAgent:
    """Observability-enabled Agent decorator"""
    
    def __init__(self, agent_id: str):
        self.agent_id = agent_id
        self.session_id = str(uuid.uuid4())  # Unique ID per session
        
    def process_request(self, user_input: str):
        """Process user request, automatically generate traces"""
        
        # Create root span representing a complete Agent interaction
        with tracer.start_as_current_span("agent_request") as root_span:
            root_span.set_attribute("agent.id", self.agent_id)
            root_span.set_attribute("session.id", self.session_id)
            root_span.set_attribute("user_input.length", len(user_input))
            
            # Stage 1: LLM inference
            with tracer.start_as_current_span("llm_inference") as llm_span:
                llm_response = self._call_llm(user_input)
                llm_span.set_attribute("llm.model", "deepseek-chat")
                llm_span.set_attribute("llm.tokens", len(json.dumps(llm_response)))
                # Record the tool the LLM decided to call
                llm_span.set_attribute("tool.planned", llm_response.get("tool_name", "none"))
            
            # Stage 2: Tool execution
            if llm_response.get("tool_name"):
                with tracer.start_as_current_span("tool_execution") as tool_span:
                    tool_span.set_attribute("tool.name", llm_response["tool_name"])
                    tool_span.set_attribute("tool.params", json.dumps(llm_response["tool_params"]))
                    tool_result = self._execute_tool(
                        llm_response["tool_name"],
                        llm_response["tool_params"]
                    )
                    tool_span.set_attribute("tool.success", tool_result["success"])
                    tool_span.set_attribute("tool.duration_ms", tool_result["duration_ms"])
                    if not tool_result["success"]:
                        tool_span.set_attribute("tool.error", tool_result["error"])
            
            # Stage 3: Output generation
            with tracer.start_as_current_span("output_generation") as output_span:
                final_output = self._generate_final_output(llm_response, tool_result)
                output_span.set_attribute("output.length", len(final_output))
            
            # Record key observability metrics
            root_span.set_attribute("total.duration_ms", 0)  # In real project, calculate latency
            
        return final_output

# Initialize tracing automatically when container starts
if __name__ == "__main__":
    agent = ObservableAgent(agent_id="customer-support-v1")
    
    while True:
        user_input = input("Please enter your question: ")
        if user_input.lower() in ["exit", "quit"]:
            break
        response = agent.process_request(user_input)
        print(f"Agent: {response}")

Key Design Elements:

Unique Session ID: self.session_id = str(uuid.uuid4()) ensures that every interaction can be fully replayed on the tracing platform.
Layered Spans: The Agent interaction is split into three spans: “LLM inference → tool call → output generation.” When a problem occurs (e.g., wrong answer), you can quickly pinpoint whether the LLM reasoning was wrong, the tool returned incorrect data, or the final generation logic had a bug.
Key Attribute Recording: Record model name, token count, tool parameters, latency, etc., for future performance analysis and capacity planning.

6.3 Session Replay: The “Silver Bullet” for Debugging Agents

Debugging an Agent is extremely difficult because the same input may produce different outputs. A powerful debugging method is session replay—saving the complete trace data (user input, LLM output, tool calls, final output) for every interaction and then replaying it in a development environment.

I recommend storing the trace data for each session in Elasticsearch or S3 and providing an API to support replay:

# Replay a specific session
curl -X POST http://agent-platform/debug/replay \
  -H "Content-Type: application/json" \
  -d '{"session_id": "550e8400-e29b-41d4-a716-446655440000"}'

# Returns: The Agent re-executes all steps of that session in replay mode and outputs detailed logs

This mechanism is extremely effective for troubleshooting issues like “Why did the Agent answer correctly last time but incorrectly this time?”

7. Advanced Tips and Pitfalls

7.1 Model Hallucination Leading to Wrong Tool Calls

Problem: When the LLM is uncertain, it may “invent” tool call parameters. For example, if a user asks “Help me look up Zhang San’s orders,” the Agent might incorrectly call the database query tool with parameters {"user": "张三", "table": "users"} instead of the actual SQL statement.

Solution:

Tool Schema Validation: Define a strict tool call format. The tool call parameters output by the LLM must pass JSON Schema validation before execution.
Pre-call Parameter Validation: Before calling the tool, perform sanity checks on the parameters. For example, if an age parameter is -5, directly reject the call.

def validate_tool_params(schema: dict, params: dict) -> bool:
    """Simple parameter validation: check type and range"""
    for key, rules in schema.items():
        if key not in params:
            return False  # Missing required parameter
        if 'type' in rules and type(params[key]).__name__ != rules['type']:
            return False  # Type mismatch
        if 'min' in rules and params[key] < rules['min']:
            return False  # Invalid range
    return True

7.2 Uncontrolled Inference Costs

Problem: The Agent’s reasoning loop may continue for many turns (especially for complex tasks), causing a surge in LLM calls and costs. I once saw an Agent call the LLM more than 50 times to answer “How to write a report.”

Solution:

Maximum Turn Limit: Set a hard limit. When the Agent’s think-act loop exceeds MAX_TURNS=10, force termination and return the existing results.
Streaming Output + Rate Limiting: Use streaming APIs to progressively return intermediate results, and add a total request limit for the same session.

MAX_TURNS = 10
LITE_LLM_COST = 0.002  # Budget cost per lightweight call

def run_with_budget(agent, user_input: str, budget: float = 0.01):
    """Agent execution with a budget: stop when exceeding budget or turn limit"""
    total_cost = 0.0
    turn_count = 0
    
    while turn_count < MAX_TURNS and total_cost < budget:
        turn_count += 1
        response = agent.run_step(user_input)
        
        # Estimate cost for this call (based on token count)
        total_cost += response['cost']
        
        if response['finished']:
            break
    
    if turn_count >= MAX_TURNS:
        print("⚠️ Reached maximum turn limit, returning partial results")
    elif total_cost >= budget:
        print("⚠️ Exceeded budget limit, returning partial results")
    
    return response['output']

7.3 Session State Leak and Credential Expiry

Problem: In a multi-tenant Kubernetes environment, if a container’s temporary data (including in-memory session state) is not thoroughly cleaned after destruction, a subsequent session might still be able to read it.

Solution:

Cleanup on Destruction: In the Pod’s lifecycle.preStop hook, ensure all temporary files and environment variables are cleaned.
Automatic Credential Rotation: Use Vault or another secret management service to set short credential validity periods (e.g., 15 minutes). When the Agent calls external APIs, use temporary credentials that automatically expire after the session ends.

Note: If your Agent uses a long-lived API Key (e.g., hardcoded in the code), this is the most urgent security issue to address—once the key is leaked, an attacker may gain permanent access to your system.

8. Summary and Expansion: From Small Team PoC to Enterprise Platform

8.1 Comparison of Two Routes: Serverless vs. Self-Built Container Cluster

Dimension	Serverless Managed (e.g., Bedrock AgentCore)	Self-Built Kubernetes Cluster
Time to launch	Hours	Days/weeks
Operations cost	Nearly zero	High (requires platform engineering team)
Customizability	Limited (runtime and extension methods are platform-defined)

Summary

Through this article, we trust you have gained a deeper understanding of Agent Development and Deployment Best Practices. We recommend practicing with real projects. If you have any questions, feel free to discuss!