Production Readiness

This page documents practical production-hardening guidance for the current framework release.

Scope

ai-agent-framework is a library, not a hosted runtime. Production readiness mostly depends on how you run it inside your application.

1. Error Boundaries

Handle framework errors at your app boundary and map them to safe user responses.

Common errors:

ModelError: provider call failed
ToolNotFoundError: model requested unknown tool
ToolValidationError: tool args did not match zod schema
MaxStepsExceededError: agent loop exhausted step budget
PromptTemplateError: missing prompt variables
OutputParserError: invalid parser input (for example non-JSON in JSON parser)

Recommended pattern:

catch framework errors in one place
return sanitized responses to users
log structured details internally with request IDs

2. Timeouts And Retries

The framework does not enforce provider timeouts/retries for you.

Recommended:

apply network timeout at provider client layer
retry only transient failures (rate limits, timeouts, transport errors)
cap retries and use jittered backoff
avoid retrying deterministic validation failures

3. Agent Guardrails

For tool-using agents:

keep maxSteps conservative (start around 6-12)
keep tool schemas strict and explicit
keep tool side effects idempotent when possible
require confirmation for destructive operations at app layer

4. Observability Baseline

Minimum signals to capture:

request ID / trace ID
prompt + tool execution latency
model/token usage from provider responses (if available)
tool call counts and failure rates
parser failure rates
max-step exhaustion count

Use hooks for lifecycle instrumentation:

hooks.onStart(state)
hooks.onEnd(state, result)

Runtime spans are also available on state for step-level timings.

5. Prompt And Output Safety

enforce strict output contracts with JsonOutputParser where possible
validate downstream business constraints after parsing
version prompts intentionally; treat prompt changes like code changes
never trust model output directly for privileged actions

6. Tool Safety

least-privilege tool design
authz checks inside tool handlers
redact secrets in tool outputs before storing in memory/logs
rate-limit expensive or external tools

7. Configuration Hygiene

keep API keys in environment/secret manager, not code
separate staging and production model configs
pin model names intentionally and review changes before upgrades

8. Testing Strategy

Layered testing:

unit test runnables, parsers, and tools in isolation
integration test chain/agent orchestration with provider mocks
golden tests for stable prompt/output contracts
failure-path tests: tool validation, missing tools, max-step exceed

9. Recommended Rollout Path

ship chain-based workflows first
add tools behind feature flags
enable agent loops for a narrow user segment
monitor latency, failure rates, and max-step exhaustions
expand traffic only after stable error budget

Current Limits

Current framework does not include:

built-in persistence or distributed queue execution
built-in auth/authz layer
built-in policy engine for tool permissions
built-in metrics export pipeline

Treat these as application responsibilities in the current release.

Scope​

1. Error Boundaries​

2. Timeouts And Retries​

3. Agent Guardrails​

4. Observability Baseline​

5. Prompt And Output Safety​

6. Tool Safety​

7. Configuration Hygiene​

8. Testing Strategy​

9. Recommended Rollout Path​

Current Limits​