Production Readiness
This page documents practical production-hardening guidance for the current framework release.
Scope
ai-agent-framework is a library, not a hosted runtime. Production readiness mostly depends on how you run it inside your application.
1. Error Boundaries
Handle framework errors at your app boundary and map them to safe user responses.
Common errors:
ModelError: provider call failedToolNotFoundError: model requested unknown toolToolValidationError: tool args did not match zod schemaMaxStepsExceededError: agent loop exhausted step budgetPromptTemplateError: missing prompt variablesOutputParserError: invalid parser input (for example non-JSON in JSON parser)
Recommended pattern:
- catch framework errors in one place
- return sanitized responses to users
- log structured details internally with request IDs
2. Timeouts And Retries
The framework does not enforce provider timeouts/retries for you.
Recommended:
- apply network timeout at provider client layer
- retry only transient failures (rate limits, timeouts, transport errors)
- cap retries and use jittered backoff
- avoid retrying deterministic validation failures
3. Agent Guardrails
For tool-using agents:
- keep
maxStepsconservative (start around6-12) - keep tool schemas strict and explicit
- keep tool side effects idempotent when possible
- require confirmation for destructive operations at app layer
4. Observability Baseline
Minimum signals to capture:
- request ID / trace ID
- prompt + tool execution latency
- model/token usage from provider responses (if available)
- tool call counts and failure rates
- parser failure rates
- max-step exhaustion count
Use hooks for lifecycle instrumentation:
hooks.onStart(state)hooks.onEnd(state, result)
Runtime spans are also available on state for step-level timings.
5. Prompt And Output Safety
- enforce strict output contracts with
JsonOutputParserwhere possible - validate downstream business constraints after parsing
- version prompts intentionally; treat prompt changes like code changes
- never trust model output directly for privileged actions
6. Tool Safety
- least-privilege tool design
- authz checks inside tool handlers
- redact secrets in tool outputs before storing in memory/logs
- rate-limit expensive or external tools
7. Configuration Hygiene
- keep API keys in environment/secret manager, not code
- separate staging and production model configs
- pin model names intentionally and review changes before upgrades
8. Testing Strategy
Layered testing:
- unit test runnables, parsers, and tools in isolation
- integration test chain/agent orchestration with provider mocks
- golden tests for stable prompt/output contracts
- failure-path tests: tool validation, missing tools, max-step exceed
9. Recommended Rollout Path
- ship chain-based workflows first
- add tools behind feature flags
- enable agent loops for a narrow user segment
- monitor latency, failure rates, and max-step exhaustions
- expand traffic only after stable error budget
Current Limits
Current framework does not include:
- built-in persistence or distributed queue execution
- built-in auth/authz layer
- built-in policy engine for tool permissions
- built-in metrics export pipeline
Treat these as application responsibilities in the current release.