Available metrics
Grand Central tracks four categories of metrics across all agents using your subscription: Request metrics show overall platform usage. Track total requests per day/month to understand growth trends. Monitor response time distributions (P50/P95/P99) to detect performance degradation. Watch success vs error rates to spot backend problems. Identify peak usage periods to plan capacity or adjust rate limits through the admin portal during high-traffic windows. Tool usage metrics reveal which operations agents invoke most frequently. The top 10 tools list helps prioritize optimization - ifgetCustomerProfile accounts for 40% of requests, caching that data reduces request volume. Tool-specific error rates identify problematic integrations (e.g., if searchTransactions fails 15% of the time, the backend API likely has issues). Usage patterns by client type show whether Claude agents behave differently than Copilot agents.
Rate limit metrics prevent surprises. Current usage vs rate limit percentage shows how close you are to limits (greater than 90%: adjust limits in portal soon). Rate limit hit frequency indicates whether agents need better caching (greater than 5% hit rate: implement conversation context caching). Subscription utilization trends help predict when you’ll need higher limits.
Authentication and security metrics detect credential problems and potential attacks. Failed authentication attempts spike when API keys expire or get revoked. API key usage patterns reveal which keys are active vs unused (rotate or revoke the unused ones). Security anomalies like midnight traffic spikes or requests from unexpected geographic regions warrant investigation. Credential rotation tracking reminds you when keys approach 90-day rotation deadlines.
Accessing your metrics
Grand Central provides a web-based monitoring dashboard with real-time and historical views. The dashboard shows current request rates (requests/minute live graph), latency distributions (P50/P95/P99 over last hour/day/week), top tools by invocation count, and error breakdowns by tool and cause. Filters let you narrow metrics by API key (useful if multiple agents share a subscription), date range, or specific tools. Dashboard access is configured through the admin portal. Users with portal access can view metrics for their subscriptions automatically. Grant additional team members access through the portal’s user management section by adding their email addresses or Entra groups. Typically, dashboard access goes to developers building the agents, DevOps teams managing deployments, and support engineers troubleshooting issues. API usage reports provide deeper analysis than real-time dashboards. The system generates monthly reports automatically that include usage summaries (total requests, tool breakdown, success rates), cost analysis (if your subscription has usage-based billing), trend analysis (month-over-month growth, seasonal patterns), and optimization recommendations based on observed usage patterns. Download reports from the dashboard’s Reports section. These reports help with budget planning and identifying optimization opportunities.Common monitoring patterns
Different teams monitor for different signals. Here’s what production teams watch:Agent performance monitoring
Track which tools agents invoke most frequently to identify caching opportunities. IfgetCustomerProfile appears in the top 3 tools (by invocation count), but agents serve 100+ conversations per day, they’re probably calling it redundantly. Implement conversation context caching to reduce load.
Monitor tool-specific error rates to find integration problems early. If searchTransactions has a 10% error rate while other tools sit at less than 1%, the backend search API likely has issues. Escalate to the team that owns that API before users complain about broken functionality.
Check whether agents retry appropriately. Look for patterns like “5 consecutive calls to the same tool within 2 seconds” - indicates retry loops without backoff. Fix agent retry logic to use exponential delays (1s, 2s, 4s).
Validate latency against SLAs. If your UX requires under 3 seconds total response time, and tool invocations average 2 seconds, you’re cutting it close. Identify slow tools and either optimize the backend or implement caching.
Rate limit management
Watch daily/monthly request counts relative to your configured rate limits. If you’re consistently hitting 85 to 95% of limits, adjust them in the admin portal before hitting hard stops. Plan for growth - if usage increases 20% month-over-month, you’ll need higher limits soon. Monitor rate limit hit frequency (how often agents get 429 errors). If greater than 5% of requests hit rate limits, agents are hammering the API inefficiently. Implement caching, batch operations, or increase rate limits through the portal. If less than 1%, your limits are appropriately sized. Identify peak usage times. If agents serve customer support during business hours (9am to 5pm), expect peaks around 10am and 2pm when call volumes surge. Spread batch processing jobs (report generation, data exports) to off-peak windows (nights, weekends) to avoid competing for capacity during peak hours. Alert on unexpected spikes. A 10x request surge at 2am suggests a problem: runaway retry loop, compromised API key being abused, or a misconfigured integration test hammering production. Investigate immediately.Security monitoring
Failed authentication attempts should be rare (less than 0.5% of requests). A spike to 5 to 10% indicates API key expiration, rotation issues, or potential credential stuffing attack. Rotate keys immediately if suspicious, review access logs with security team. API key usage patterns reveal whether keys are appropriately scoped. If a development API key shows production-level traffic, developers are using the wrong credentials - enforce environment separation. If a key sits unused for 30 or more days, revoke it to reduce attack surface. Access patterns inconsistent with normal usage warrant investigation. Examples: requests from unexpected geographic regions (your agents run in US-East, but metrics show traffic from Asia), unusual time-of-day patterns (customer support agent active at 3am), or suspicious tool combinations (why is a read-only support agent calling admin deletion tools?). When security alerts trigger: Rotate compromised API keys immediately through the admin portal (generate new key, deploy to agents, revoke old key). Review access logs in the monitoring dashboard to understand scope of potential breach. Implement stricter access controls through the portal (add JWT authentication requirements, adjust tool access). Enable additional monitoring through the alerting configuration.Alerting
Configure alerts to catch problems before they impact users. Set up alerting rules through the admin portal with custom thresholds and notification channels (email, Slack, PagerDuty):Rate limit alerts
Rate limit alerts
Trigger: Approaching or exceeding rate limitsNotification: Email or Slack when usage reaches 80%, 90%, 95%Action: Adjust rate limits in admin portal or optimize usage
Error rate alerts
Error rate alerts
Trigger: Error rate exceeds threshold (e.g., greater than 5%)Notification: Alert to dev team via configured channelsAction: Investigate failing operations, check API health
Latency alerts
Latency alerts
Trigger: Response times exceed SLA (e.g., P95 > 2s)Notification: Performance degradation warningAction: Check backend API performance, review load
Security alerts
Security alerts
Trigger: Unusual authentication failures or access patternsNotification: Immediate alert to security teamAction: Rotate keys, review access logs, investigate breach