Building Production-Ready AI Agents: Lessons from the Trenches
Practical insights from architecting and deploying AI agents in enterprise environments, including common pitfalls and strategies that actually work.
After spending the last two years building AI agents for enterprise systems—from GitLab merge request reviewers to ServiceNow integration tools—I’ve learned that production-ready AI agents require far more than just prompting an LLM. Here’s what actually matters.
The Reality of Production AI Agents
Most discussions about AI agents focus on the exciting parts: reasoning capabilities, tool use, and autonomy. But production deployment reveals a different set of challenges. Your agent needs to handle:
- Reliability under edge cases - What happens when the API times out mid-conversation?
- Observability - How do you debug a decision made by an LLM three steps ago?
- Cost management - Token costs add up quickly at scale
- Error recovery - Graceful degradation when tools fail
The biggest mistake I see teams make is treating AI agents like traditional software. They’re probabilistic systems that require different architectural patterns, monitoring strategies, and failure modes.
Architecture Patterns That Work
Here’s a simplified architecture I use for most production agents:
graph TD
A[User Input] --> B[Input Validation]
B --> C[Context Builder]
C --> D[LLM Orchestrator]
D --> E{Tool Required?}
E -->|Yes| F[Tool Executor]
F --> G[Result Validator]
G --> D
E -->|No| H[Response Generator]
H --> I[Output Formatter]
I --> J[User]
K[Observability Layer] -.-> B
K -.-> C
K -.-> D
K -.-> F
K -.-> H
graph TD
A[User Input] --> B[Input Validation]
B --> C[Context Builder]
C --> D[LLM Orchestrator]
D --> E{Tool Required?}
E -->|Yes| F[Tool Executor]
F --> G[Result Validator]
G --> D
E -->|No| H[Response Generator]
H --> I[Output Formatter]
I --> J[User]
K[Observability Layer] -.-> B
K -.-> C
K -.-> D
K -.-> F
K -.-> H
Key components explained:
1. Input Validation
Never trust user input directly. Validate, sanitize, and structure it before sending to the LLM:
from pydantic import BaseModel, Field
class UserQuery(BaseModel):
query: str = Field(..., min_length=1, max_length=2000)
context: dict = Field(default_factory=dict)
def sanitize(self):
# Remove potentially harmful patterns
self.query = self.query.strip()
# Add your sanitization logic
return self
2. Context Builder
Build relevant context intelligently. Don’t dump everything into the prompt:
async def build_context(query: UserQuery, vector_db: VectorStore):
# Retrieve relevant documents
docs = await vector_db.similarity_search(
query.query,
k=5,
threshold=0.7
)
# Prioritize by relevance and recency
ranked_docs = rank_by_relevance_and_time(docs)
# Fit within token budget
context = fit_token_budget(ranked_docs, max_tokens=1500)
return context
3. Tool Execution with Retry Logic
Tools will fail. Build resilience from day one:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10)
)
async def execute_tool(tool_name: str, params: dict):
try:
result = await tools[tool_name].execute(**params)
return {"success": True, "data": result}
except Exception as e:
logger.error(f"Tool {tool_name} failed: {str(e)}")
return {"success": False, "error": str(e)}
Always implement circuit breakers for external service calls. When GitLab’s API goes down, you don’t want your agent to retry indefinitely and rack up costs.
Observability is Non-Negotiable
You cannot debug what you cannot see. Instrument everything:
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
tracer = trace.get_tracer(__name__)
async def agent_loop(query: str):
with tracer.start_as_current_span("agent_execution") as span:
span.set_attribute("query_length", len(query))
logger.info("agent.started", query=query[:100])
# Track token usage
token_counter = TokenCounter()
try:
response = await llm.generate(
prompt=build_prompt(query),
callbacks=[token_counter]
)
span.set_attribute("tokens_used", token_counter.total)
logger.info("agent.completed",
tokens=token_counter.total,
cost=calculate_cost(token_counter.total))
return response
except Exception as e:
span.record_exception(e)
logger.error("agent.failed", error=str(e))
raise
Cost Management Strategies
Token costs matter at scale. Here’s what works:
- Aggressive caching - Cache LLM responses for common queries
- Smart model selection - Use cheaper models for simple tasks
- Streaming responses - Start showing results before completion
- Prompt optimization - Every token counts; compress ruthlessly
Example cost-aware routing:
async def route_to_model(query_complexity: float, budget: float):
if query_complexity < 0.3 and budget < 0.01:
return "claude-haiku-4-5" # Fast and cheap
elif query_complexity < 0.7:
return "claude-sonnet-4-5" # Balanced
else:
return "claude-opus-4-1" # Complex reasoning
Real-World Example: GitLab MR Reviewer
Here’s a simplified version of the GitLab merge request reviewer I built:
class GitLabMRReviewer:
def __init__(self, llm_client, gitlab_client):
self.llm = llm_client
self.gitlab = gitlab_client
async def review_mr(self, project_id: int, mr_id: int):
# Fetch MR details
mr = await self.gitlab.get_merge_request(project_id, mr_id)
diff = await self.gitlab.get_diff(project_id, mr_id)
# Build context
context = {
"title": mr.title,
"description": mr.description,
"changes": self.parse_diff(diff),
"project_context": await self.get_project_context(project_id)
}
# Generate review
review = await self.llm.generate(
prompt=self.build_review_prompt(context),
max_tokens=1500
)
# Post as comment
await self.gitlab.create_comment(
project_id, mr_id,
self.format_review(review)
)
return review
The full implementation includes error handling, rate limiting, and extensive logging—about 500 lines total.
What’s Next
The AI agent space is evolving rapidly. I’m particularly excited about:
- Agent-to-agent communication (A2A protocol) - Enabling complex multi-agent workflows
- Improved tool ecosystems - MCP (Model Context Protocol) standardization
- Better reasoning models - GPT-4o, Claude Opus 4, and beyond
I’m working on a detailed guide covering multi-agent orchestration and the A2A protocol. Subscribe below to get notified when it’s published.
Key Takeaways
Building production AI agents requires:
- Robust architecture - Plan for failures, not just success paths
- Comprehensive observability - You can’t improve what you can’t measure
- Cost awareness - Token costs scale with usage; optimize early
- Iterative refinement - Your first prompt won’t be your last
The gap between a demo and production is larger than most anticipate. But with the right architectural patterns and operational discipline, AI agents can deliver tremendous value in enterprise environments.
What challenges have you faced building AI agents? I’d love to hear about your experiences. Connect with me on LinkedIn or reach out directly.
Want more insights like this?
Get notified when I publish new articles on AI, architecture, and building intelligent systems.
Get in Touch
Discussion
Have thoughts or questions? Join the discussion on GitHub. View all discussions