Skip to main content
Production agents require observability to detect issues early, optimize performance, and plan capacity. Grand Central provides dashboards and metrics that show how your agents use tools, where they hit limits, and what errors they encounter.

Available metrics

Grand Central tracks four categories of metrics across all agents using your subscription: Request metrics show overall platform usage. Track total requests per day/month to understand growth trends. Monitor response time distributions (P50/P95/P99) to detect performance degradation. Watch success vs error rates to spot backend problems. Identify peak usage periods to plan capacity or adjust rate limits through the admin portal during high-traffic windows. Tool usage metrics reveal which operations agents invoke most frequently. The top 10 tools list helps prioritize optimization - if getCustomerProfile accounts for 40% of requests, caching that data reduces request volume. Tool-specific error rates identify problematic integrations (e.g., if searchTransactions fails 15% of the time, the backend API likely has issues). Usage patterns by client type show whether Claude agents behave differently than Copilot agents. Rate limit metrics prevent surprises. Current usage vs rate limit percentage shows how close you are to limits (greater than 90%: adjust limits in portal soon). Rate limit hit frequency indicates whether agents need better caching (greater than 5% hit rate: implement conversation context caching). Subscription utilization trends help predict when you’ll need higher limits. Authentication and security metrics detect credential problems and potential attacks. Failed authentication attempts spike when API keys expire or get revoked. API key usage patterns reveal which keys are active vs unused (rotate or revoke the unused ones). Security anomalies like midnight traffic spikes or requests from unexpected geographic regions warrant investigation. Credential rotation tracking reminds you when keys approach 90-day rotation deadlines.

Accessing your metrics

Grand Central provides a web-based monitoring dashboard with real-time and historical views. The dashboard shows current request rates (requests/minute live graph), latency distributions (P50/P95/P99 over last hour/day/week), top tools by invocation count, and error breakdowns by tool and cause. Filters let you narrow metrics by API key (useful if multiple agents share a subscription), date range, or specific tools. Dashboard access is configured through the admin portal. Users with portal access can view metrics for their subscriptions automatically. Grant additional team members access through the portal’s user management section by adding their email addresses or Entra groups. Typically, dashboard access goes to developers building the agents, DevOps teams managing deployments, and support engineers troubleshooting issues. API usage reports provide deeper analysis than real-time dashboards. The system generates monthly reports automatically that include usage summaries (total requests, tool breakdown, success rates), cost analysis (if your subscription has usage-based billing), trend analysis (month-over-month growth, seasonal patterns), and optimization recommendations based on observed usage patterns. Download reports from the dashboard’s Reports section. These reports help with budget planning and identifying optimization opportunities.

Common monitoring patterns

Different teams monitor for different signals. Here’s what production teams watch:

Agent performance monitoring

Track which tools agents invoke most frequently to identify caching opportunities. If getCustomerProfile appears in the top 3 tools (by invocation count), but agents serve 100+ conversations per day, they’re probably calling it redundantly. Implement conversation context caching to reduce load. Monitor tool-specific error rates to find integration problems early. If searchTransactions has a 10% error rate while other tools sit at less than 1%, the backend search API likely has issues. Escalate to the team that owns that API before users complain about broken functionality. Check whether agents retry appropriately. Look for patterns like “5 consecutive calls to the same tool within 2 seconds” - indicates retry loops without backoff. Fix agent retry logic to use exponential delays (1s, 2s, 4s). Validate latency against SLAs. If your UX requires under 3 seconds total response time, and tool invocations average 2 seconds, you’re cutting it close. Identify slow tools and either optimize the backend or implement caching.

Rate limit management

Watch daily/monthly request counts relative to your configured rate limits. If you’re consistently hitting 85 to 95% of limits, adjust them in the admin portal before hitting hard stops. Plan for growth - if usage increases 20% month-over-month, you’ll need higher limits soon. Monitor rate limit hit frequency (how often agents get 429 errors). If greater than 5% of requests hit rate limits, agents are hammering the API inefficiently. Implement caching, batch operations, or increase rate limits through the portal. If less than 1%, your limits are appropriately sized. Identify peak usage times. If agents serve customer support during business hours (9am to 5pm), expect peaks around 10am and 2pm when call volumes surge. Spread batch processing jobs (report generation, data exports) to off-peak windows (nights, weekends) to avoid competing for capacity during peak hours. Alert on unexpected spikes. A 10x request surge at 2am suggests a problem: runaway retry loop, compromised API key being abused, or a misconfigured integration test hammering production. Investigate immediately.

Security monitoring

Failed authentication attempts should be rare (less than 0.5% of requests). A spike to 5 to 10% indicates API key expiration, rotation issues, or potential credential stuffing attack. Rotate keys immediately if suspicious, review access logs with security team. API key usage patterns reveal whether keys are appropriately scoped. If a development API key shows production-level traffic, developers are using the wrong credentials - enforce environment separation. If a key sits unused for 30 or more days, revoke it to reduce attack surface. Access patterns inconsistent with normal usage warrant investigation. Examples: requests from unexpected geographic regions (your agents run in US-East, but metrics show traffic from Asia), unusual time-of-day patterns (customer support agent active at 3am), or suspicious tool combinations (why is a read-only support agent calling admin deletion tools?). When security alerts trigger: Rotate compromised API keys immediately through the admin portal (generate new key, deploy to agents, revoke old key). Review access logs in the monitoring dashboard to understand scope of potential breach. Implement stricter access controls through the portal (add JWT authentication requirements, adjust tool access). Enable additional monitoring through the alerting configuration.

Alerting

Configure alerts to catch problems before they impact users. Set up alerting rules through the admin portal with custom thresholds and notification channels (email, Slack, PagerDuty):
Trigger: Approaching or exceeding rate limitsNotification: Email or Slack when usage reaches 80%, 90%, 95%Action: Adjust rate limits in admin portal or optimize usage
Trigger: Error rate exceeds threshold (e.g., greater than 5%)Notification: Alert to dev team via configured channelsAction: Investigate failing operations, check API health
Trigger: Response times exceed SLA (e.g., P95 > 2s)Notification: Performance degradation warningAction: Check backend API performance, review load
Trigger: Unusual authentication failures or access patternsNotification: Immediate alert to security teamAction: Rotate keys, review access logs, investigate breach
Configure alerts through the dashboard’s Alerting section with custom thresholds, notification recipients (email addresses, Slack webhooks, PagerDuty integration keys), delivery channels, and sensitivity settings to reduce false positives. Changes take effect immediately.

Troubleshooting with metrics

When problems occur, metrics provide the diagnostic data needed to identify root causes quickly.

High error rates

Symptom: Error rate increases from baseline less than 1% to 5 to 10% for specific tools. Investigation steps: Check error messages in the platform dashboard (grouped by error code and tool). Identify which tool(s) are failing - is it one tool or multiple? Verify backend API health via health check endpoints or APM tools. Review recent changes to agent code (did a deployment introduce invalid parameter formats?). Common causes: Backend API downtime or degradation (5xx errors), invalid parameters from agent logic changes (validation errors -32602), authentication/authorization failures (expired JWT tokens, revoked API keys), rate limits exceeded (agents not implementing backoff).

Slow response times

Symptom: P95 latency increases from typical 200-300ms to 2-5+ seconds. Agents time out waiting for responses. Investigation steps: Check P95/P99 latency trends over the last 24 hours - gradual increase or sudden spike? Identify slow-performing tools - is one tool dragging down averages? Review request volume for correlation - did latency spike when traffic increased 3x? Check backend API performance via APM dashboards (database query times, external API calls). Common causes: Backend database slow queries (missing indexes, full table scans), high load on backend systems (CPU/memory saturation during peak traffic), network issues (packet loss, DNS resolution delays), large response payloads (returning 10MB JSON when 50KB would suffice).

Unexpected usage spikes

Symptom: Request volume suddenly jumps 5 to 10 times above baseline (e.g., from 100/minute to 800/minute). Investigation steps: Check which tools are being called more - one tool or all tools? Identify which API keys/clients are responsible via dashboard filters. Review for new agent deployments around the spike time. Look for retry storm patterns (hundreds of requests within seconds to the same tool). Check for automated testing accidentally hitting production. Common causes: New agent deployment with higher user adoption than expected, agent retry logic gone wrong (no backoff, infinite retries), automated testing misconfigured to hit production instead of staging, security incident (compromised API key being abused for data scraping).

Best practices

Review metrics weekly to spot trends early. Gradual increases in error rates or latency often precede major outages - catch them before users complain. Establish baselines for normal usage patterns. You can’t detect anomalies without knowing what “normal” looks like. Document typical request volumes (100/min during business hours, 10/min at night), error rates (less than 1%), and latency (P95: 250ms). Document seasonal variations. If you know request volume doubles during tax season (January-April) or Black Friday weekend, that’s not an anomaly - it’s expected. Note these patterns so you don’t panic when they recur. Configure alerts proactively before problems occur. Don’t wait for production outages to set up monitoring - configure rate limit alerts, error rate alerts, and latency alerts during initial deployment through the admin portal.

Next steps