Skip to main content
Building reliable AI agents that use Grand Central’s MCP tools requires thoughtful design around tool selection, error handling, performance, and security. This page captures patterns from production deployments - what works, what fails, and how to avoid common pitfalls.

Tool selection

Enable only the tools your agent genuinely needs. Start with 2 to 3 essential operations (e.g., getCustomerProfile, searchKnowledgeBase) and expand as use cases evolve. Automated validation runs faster for focused tool sets, so enabling 20 tools upfront means longer validation time. You can always enable additional tools later through the admin portal as your agent’s capabilities grow. Match tools to your agent’s specific purpose. A customer support agent needs account lookup and order history - it doesn’t need payment processing or admin operations. A lending advisor needs credit checks and loan calculations - it doesn’t need transaction search. Narrow tool scopes accelerate automated validation and minimize the blast radius if credentials leak. Avoid redundancy. If two backend APIs provide similar functionality (e.g., GET /customers/{id} and GET /users/{id} both return customer data), enable only one through the portal. Multiple overlapping tools confuse agents during tool selection and waste rate limit capacity on duplicate functionality. Configure minimal permissions for each tool. Read-only access passes automated validation faster than write access. User-scoped tools (limited to authenticated user’s data) are safer than admin tools (access to all customers). Be explicit when configuring tool access: “This agent needs read-only, user-scoped access to customer profiles for personalization.”

Enabling new tools

When enabling MCP exposure for an API operation through the admin portal, provide clear configuration details that help automated validation understand the security implications. Well-configured tools pass validation in 1 to 3 days for low-risk operations; incomplete configurations may require manual review extending to 3 days for high-risk operations. Good configuration example:
Operation: GET /api/customers/{id}
Use Case: Customer support agent needs to lookup customer details to personalize 
          responses when helping with account issues. Agent will use name, email, 
          and account status to confirm identity and tailor advice.
Expected Volume: ~200 calls/day during business hours (9am to 5pm EST)
Data Sensitivity: PII (names, emails, phone numbers, account status)
Required Permissions: Read-only, user-scoped (agent can only access data for the 
                      authenticated customer they're assisting)
User Authentication: JWT token with customerId claim, issued by our auth service
This configuration provides the context automated validation needs: Why enable this tool? How much load will it create? What sensitive data is exposed? What controls limit access? Poor configuration example:
Operation: GET /api/customers/{id}
Use Case: Need customer data for agent
This lacks the details automated validation needs. Your configuration may be flagged for manual review, extending validation time. Specify permission levels explicitly. Read-only tools pass automated validation faster than write operations. User-scoped tools (limited to authenticated user’s data) are safer than admin tools (access all customers). Public data tools (currency rates, help articles) need minimal configuration. Admin operations (delete accounts, override limits) require strong business justification and typically trigger manual review.
The more permissive the tool, the longer validation takes. Configure minimal necessary access to speed up validation. You can always adjust permissions later through the portal if requirements change.

AI agent design

Well-designed agents follow predictable patterns that improve reliability and user experience. Here’s the workflow that works in production: Call tools/list once at startup to discover available tools. Cache the result for the agent’s lifetime - tool definitions don’t change mid-session. This avoids wasting rate limit quota on redundant discovery calls. During startup, parse tool descriptions and parameter schemas so the agent understands what each tool does and what inputs it requires. Match tool descriptions to user intent. When a user asks “What’s my account balance?”, the agent should recognize that getAccountBalance (description: “Retrieve current balance for a bank account”) is the right tool. Train your agent to read tool metadata and choose appropriately. Generic instructions like “You have access to tools, use them when needed” lead to agents that guess randomly or ask users which tool to invoke. Validate parameters before invocation. Check required fields, data types, and format patterns from the tool’s inputSchema before calling tools/call. If the schema requires accountId matching pattern ^ACC-[0-9]{6}$, and the user provides “12345”, catch that error in agent logic rather than firing a doomed API call. Better user experience: “That doesn’t look like a valid account ID - they usually start with ACC- followed by 6 digits.” Handle tool failures gracefully. Don’t let 401 errors or rate limits crash the conversation. Implement fallback behavior: retry with exponential backoff for transient errors, ask users to re-authenticate for auth failures, apologize and offer human escalation for persistent problems. Never fabricate data if a tool call fails - tell users the truth about what went wrong.

Prompt engineering for tools

Effective prompts explicitly connect tools to use cases so agents know when to invoke each one:
You are a customer support assistant for JetBank with access to these tools:

Tools:
- getCustomerProfile(customerId): Retrieve customer name, email, account status. 
  USE WHEN: User asks about their account, you need to personalize responses.
  REQUIRES: User must be authenticated (JWT token with customerId claim).

- searchKnowledgeBase(query): Find help articles about products and policies.
  USE WHEN: User asks product questions, policy questions, "how do I" questions.
  PUBLIC: No authentication needed, safe to call anytime.

- listTransactions(accountId, startDate, endDate): Get transaction history.
  USE WHEN: User asks "What did I spend on?", "Show my recent activity".
  REQUIRES: accountId format ACC-XXXXXX, authenticated user must own the account.

Guidelines:
1. Always verify customer identity before calling getCustomerProfile or listTransactions.
2. Try searchKnowledgeBase first for common questions ("What's your return policy?").
3. If a tool fails, apologize: "I'm having trouble accessing that right now. Let me connect you with a specialist."
4. NEVER fabricate customer data. If getCustomerProfile returns an error, say "I couldn't retrieve your account information" instead of guessing.
5. Cache tool results in conversation context - don't call getCustomerProfile multiple times in one conversation.
Compare that to a useless prompt:
You are a customer support assistant. You have access to tools. Use them when needed.
The detailed version gives the agent decision criteria (when to use each tool), authentication awareness (which tools need user context), and error handling patterns (what to do when tools fail).

Rate limit awareness

Agent design directly impacts how quickly you hit rate limits. Cache tool results in conversation context to avoid redundant calls - if you fetch a customer profile at the start of a conversation, store it and reference the cached data instead of calling getCustomerProfile five times. Batch operations when possible - if the backend offers a listAccountBalances (plural) endpoint that accepts multiple account IDs, use that instead of sequential getAccountBalance calls. Implement exponential backoff when you hit 429 errors: wait 1s, 2s, 4s, 8s between retries rather than hammering the API. Here’s what good caching looks like in practice:
class CustomerSupportAgent:
    def __init__(self):
        self.conversation_cache = {}  # Reset per conversation
    
    def get_customer_profile(self, customer_id):
        # Check cache first
        if customer_id in self.conversation_cache:
            return self.conversation_cache[customer_id]
        
        # Cache miss - call tool and store result
        profile = self.mcp_client.call_tool(
            "getCustomerProfile", 
            {"customerId": customer_id}
        )
        self.conversation_cache[customer_id] = profile
        return profile
This pattern reduces API calls by ~70% in typical support conversations where agents reference customer data multiple times.

Security best practices

Security failures with AI agents are often subtle - accidentally logging PII, invoking tools without proper user authentication, or leaking API keys in error messages. Follow these patterns to avoid common pitfalls.

Protect sensitive data

Never log tool responses without data governance approval. Tool responses may contain PII (names, emails, account numbers) or sensitive business data (transaction amounts, credit scores). Logging “customer profile lookup succeeded” is fine. Logging the actual profile JSON ({"name": "Jane Doe", "ssn": "123-45-6789", ...}) violates privacy policies in most jurisdictions. If you need audit logs, log request metadata (tool name, customer ID, timestamp, success/failure) rather than response payloads. Validate user context before calling user-scoped tools. Don’t accept unauthenticated user input like “show me profile for customer 12345” and directly call getCustomerProfile(id=12345). The attacker just tricked your agent into exposing someone else’s data. Instead, extract the customer ID from the authenticated JWT token and use that: getCustomerProfile(id=jwt.claims.customerId). If users need to look up other customers (e.g., support agents helping end users), verify the support agent has appropriate permissions via role checks. Remember that Grand Central logs all tool invocations for audit purposes. Your agent’s actions are traceable: who called what tool, when, with what parameters, and whether it succeeded. Don’t use tools for purposes outside their approved use case - invoking getCustomerProfile to scrape customer data for marketing analytics will get flagged in audit reviews. Stick to the use case you justified in your MCP access request.

API key management

Store API keys in secure secret management systems (Azure Key Vault, AWS Secrets Manager, HashiCorp Vault) rather than configuration files. Rotate keys regularly (every 90 days, or immediately if compromise is suspected): request a new key, update agent config, test with new key, revoke old key. Use different API keys for dev/staging/production environments - if a dev key leaks, production isn’t compromised. Monitor Grand Central’s usage dashboard for unexpected activity (midnight API calls when your agent should be idle, calls from unexpected IP ranges). Never hardcode API keys in source code, even in private repositories - git history is permanent, and repos get forked or made public accidentally. Don’t share API keys between different applications - if one app is compromised, all apps using that key are exposed. Don’t log or print API keys in application output (error messages, debug logs, dashboards). Don’t store keys in version control, even in private repositories protected by access controls.

Error handling

Implement graceful degradation

Your AI agent should handle tool failures without breaking user experience:
class CustomerSupportAgent:
    def get_customer_info(self, customer_id):
        try:
            return self.mcp_client.call_tool(
                "getCustomerProfile", 
                {"customerId": customer_id}
            )
        except RateLimitError:
            return "I'm experiencing high load. Please try again in a moment."
        except UnauthorizedError:
            return "I don't have permission to access that information."
        except ToolNotFoundError:
            return "That feature is temporarily unavailable."
        except Exception as e:
            # Log error internally
            self.logger.error(f"Tool call failed: {e}")
            # User-friendly message
            return "I encountered an error. Please contact human support."

Common error scenarios

Error TypeUser ExperienceAgent Response
Rate limit exceededTemporary delay”I’m processing many requests. Please wait 30 seconds.”
Authentication failedConfiguration issueContact your administrator
Tool not foundFeature unavailable”That feature is currently unavailable.”
Invalid parametersAgent mistakeRetry with corrected parameters
TimeoutBackend slow/down”This is taking longer than expected. Let me try again.”

Performance optimization

Agent performance directly affects user satisfaction. Slow agents feel broken even if they’re technically working. Optimize aggressively: Call tools/list once at startup and cache the result - don’t waste latency and rate limit quota discovering tools on every request. Tool definitions don’t change mid-session. Cache them for the agent’s lifetime (or until you detect a deployment that adds new tools). Store tool results in conversation context to avoid redundant calls. If a user asks “What’s my name?” and you call getCustomerProfile, cache that result. When they later ask “What’s my email?”, reference the cached profile instead of calling the tool again. This pattern reduces API calls by 60-70% in typical conversations. Use batch endpoints when processing collections. If you need balances for 5 accounts, check whether the backend offers listAccountBalances (plural) that accepts an array of IDs. One batched call (200ms) beats five sequential calls (5x150ms = 750ms). Not all tools support batching, but check during discovery - array-type parameters often indicate batch support.

Response time expectations

Different operations have different performance profiles. Set user expectations appropriately: Tool discovery (tools/list): under 500ms. Fast, should happen invisibly at startup. Read operations (getCustomer, searchTransactions): 1 to 3 seconds. Moderate - users tolerate brief waits for data retrieval. Write operations (createPayment, updateAccount): 2 to 5 seconds. Slower due to validation, database writes, and audit logging. Complex operations (generateMonthlyReport, calculateCreditScore): 10 to 30+ seconds. Very slow - backend processing, aggregations, third-party API calls. For slow operations, provide feedback so users don’t think the agent is frozen:
if tool_name == "generateMonthlyReport":
    print("Generating your monthly report - this typically takes 20-30 seconds...")
    result = agent.call_tool("generateMonthlyReport", {"customerId": "CUST-789123"})
    print("Report ready!")
Without feedback, users assume failure after ~5 seconds and close the conversation window.

Testing your agent

Before deploying to production, test failure scenarios to verify your agent degrades gracefully. Production environments are hostile - rate limits trigger, authentication expires, backends time out. Agents that handle these conditions gracefully provide better user experiences than agents optimized only for the happy path. Rate limit testing: Fire 150 requests rapidly to trigger 429 errors (typical limit: 100/minute). Does your agent implement exponential backoff? Does it inform users about delays? Or does it crash with an unhandled exception? Authentication failure testing: Revoke your API key (or use an invalid key) and attempt tool invocations. Does your agent detect 401 errors and prompt users to re-authenticate? Or does it retry indefinitely, burning CPU and confusing users? Tool unavailability testing: Simulate Grand Central platform outages by blocking network access to the MCP endpoint. Does your agent fall back to knowledge-only responses? Does it offer escalation to human support? Or does it show cryptic connection errors? Invalid parameter testing: Call tools with malformed data (wrong data types, missing required fields, format pattern violations). Does your agent parse validation errors and ask users for corrected input? Or does it expose raw JSON-RPC error codes? Here’s a smoke test script to run before every deployment:
def test_mcp_connection():
    """Run this before deploying agent to production"""
    
    # 1. Tool discovery works
    try:
        tools = mcp_client.call("tools/list")
        print(f"✓ Tool discovery: {len(tools.tools)} tools available")
    except Exception as e:
        print(f"✗ Tool discovery failed: {e}")
        return False
    
    # 2. Tool invocation works (use a safe read-only operation)
    try:
        result = mcp_client.call_tool("getAccountBalance", {
            "accountId": "ACC-999999"  # Test account
        })
        print(f"✓ Tool invocation works")
    except Exception as e:
        print(f"✗ Tool invocation failed: {e}")
        return False
    
    # 3. Rate limit handling works
    print("Testing rate limit handling (this will trigger 429 errors)...")
    try:
        for i in range(150):  # Exceed typical 100/minute limit
            mcp_client.call("tools/list")
    except RateLimitError as e:
        print(f"✓ Rate limit handling works: {e}")
    except Exception as e:
        print(f"⚠ Unexpected error during rate limit test: {e}")
    
    # 4. Error handling works
    try:
        mcp_client.call_tool("getAccountBalance", {
            "accountId": "INVALID"  # Will trigger validation error
        })
    except ValidationError as e:
        print(f"✓ Validation error handling works")
    except Exception as e:
        print(f"⚠ Unexpected error type: {e}")
    
    return True
Run this script in your CI/CD pipeline before promoting builds to production.

Monitoring and observability

Production agents need observability to detect problems before users complain. Track metrics from both your agent’s perspective (client-side telemetry) and Grand Central’s dashboard (server-side platform metrics). Client-side metrics (instrument your agent code):
  • Tool call success rate: % of invocations that return results vs errors. Target: >95%.
  • Average response time: P50/P95/P99 latency for tool invocations. Watch for degradation trends.
  • Rate limit hit frequency: How often you hit 429 errors. If greater than 5%, adjust limits through admin portal.
  • Tool usage distribution: Which tools get invoked most frequently. Informs caching strategy.
  • Error type breakdown: Authentication (401), validation (-32602), backend (5xx). Helps prioritize fixes.
Server-side metrics (Grand Central dashboard):
  • Total tool invocations: Overall request volume, trends over time.
  • Rate limit consumption: % of limit used. Alert at 90% to avoid hitting hard stops.
  • Cost per tool: If your subscription has usage-based pricing.
  • Authentication failures: Spike indicates credential issues or attacks.
Dashboard access is automatically configured when you enable MCP through the admin portal. Dashboards show aggregated metrics across all agents using your subscription key.

Alert configuration

Set up alerts that trigger before problems impact users: Critical alerts (page on-call immediately):
  • Authentication failures greater than 5%: Indicates API key expired, revoked, or misconfigured. Check secret management system.
  • Tool error rate greater than 5%: Backend APIs are failing. Check Grand Central status page and system status in admin portal.
  • Rate limit hits greater than 10%: You’re hitting limits frequently. Implement better caching or adjust limits through admin portal.
Warning alerts (investigate during business hours):
  • P95 response time greater than 5s: Performance degrading. Check backend API status, review agent caching strategy.
  • Rate limit consumption greater than 90%: Approaching limit. Monitor closely and prepare to adjust through portal if sustained.

Production readiness checklist

Before deploying your agent to production, verify these requirements: Security:
  • API keys stored in secret management system (Azure Key Vault, AWS Secrets Manager, not code)
  • Separate API keys for dev/staging/production environments
  • Rate limit handling implemented (exponential backoff, user feedback)
  • Authentication failure handling (prompt re-auth, don’t crash)
  • No sensitive data logged (log metadata, not response payloads)
Reliability:
  • Tool failure graceful degradation tested (what happens when tools fail?)
  • Retry logic with exponential backoff (1s, 2s, 4s, 8s delays)
  • Timeout handling for slow operations (>30s operations have user feedback)
  • Circuit breaker for repeated failures (stop hammering failing APIs)
  • Fallback behavior documented (offer human escalation when tools unavailable)
Performance:
  • Tool discovery cached at startup (don’t call tools/list on every request)
  • Conversation context stores tool results (avoid redundant calls within conversation)
  • Redundant tool calls eliminated (profile caching, batch operations)
  • Response time expectations set for users (“Generating report, ~30s…”)
Monitoring:
  • Internal metrics collection enabled (success rate, latency, error types)
  • Access to Grand Central dashboard requested and granted
  • Alerts configured for authentication failures, rate limits, error rates
  • Runbook documented for common issues (see “Common Issues” section below)

Common pitfalls

Overusing tools wastes rate limit quota and slows response times. Don’t call searchKnowledgeBase when the user asks “What’s your return policy?” - that’s static information the agent should know from training data. Save tool invocations for dynamic, user-specific data: “Have I returned anything this year?” requires getOrderHistory, but general policy questions don’t. Train agents to distinguish knowledge questions (answer immediately) from data retrieval questions (invoke tools). Ignoring tool descriptions leads to agents using the wrong tools. If your agent doesn’t read the description for searchOrders (“Returns order history for the last 90 days only”), it might invoke that tool when users ask “Show me my orders from 2020” - which will return zero results even though the backend has older data available. Include tool descriptions in agent prompts and train the agent to match descriptions to user intent. Exposing technical errors to users creates terrible UX. Never show messages like Error -32602: Invalid parameter 'customerId' must match regex ^[0-9]{5}$ to end users. Translate technical errors to plain language: “I couldn’t find that account. Please check the account number and try again.” Parse error codes in agent logic and provide context-appropriate responses.

Documentation

Maintain a runbook for your team (support engineers, on-call rotations) that documents tools, common issues, and escalation paths. Example structure:
# Customer support agent - MCP runbook

## Available tools
- getCustomerProfile(customerId): Retrieve customer name, email, account status. Contains PII - handle carefully.
- listTransactions(accountId, startDate, endDate): Transaction history. Requires user authentication.
- searchKnowledgeBase(query): Help articles. Safe, public data.

## Common issues

### "rate limit exceeded" (429 errors)
**Cause**: Agent exceeded 100 calls/minute quota
**Solution**: Wait 60 seconds for window to reset. If sustained, request quota increase via platform team.
**Prevention**: Implement caching (see "Performance" section above)

### "authentication failed" (401 errors)
**Cause**: API key expired, revoked, or misconfigured
**Solution**: Check secret management system for current key. Request new key from platform team if needed.
**Escalation**: If key is valid but errors persist, contact Grand Central support via platform team.

### Tool unavailable errors
**Cause**: Grand Central platform outage or maintenance
**Solution**: Check status page at status.grandcentral.example.com. Agents should fall back to knowledge-only responses.

## Escalation
- Platform Team: [email protected], Slack: #platform-support
- Grand Central Support: Contact via platform team (they have direct support channel)
Update this document when tools change, common issues evolve, or escalation contacts shift.

Next steps