Monitor - Grand Central Documentation

This page covers how to monitor, observe, and troubleshoot your custom connectors in production using the available monitoring and observability tools.

Overview

The platform provides comprehensive monitoring capabilities through:

DataDog - Application performance monitoring, logs, and metrics
Grafana - Dashboards, metrics visualization, and alerting
Kubernetes - Container and cluster-level monitoring
Azure APIM - API analytics and usage metrics

Accessing monitoring tools

Access monitoring tools to view logs, metrics, and traces for your connectors.

DataDog access

Prerequisites

To access DataDog, you need:

Azure access (user must be in Azure AD)
DataDog roles assigned in self-service.tfvars:

team_members = {
  "[email protected]" = {
    datadog = {
      roles = ["dev-rw", "stg-ro"]
    }
  }
}

Available DataDog Roles:

{env}-ro: Read-only access to logs and metrics
{env}-rw: Full access including creating dashboards, synthetics, and monitors

Accessing DataDog

Navigate to: gc-ecos.datadoghq.eu
First login: May result in HTTP 500 error (just-in-time provisioning)
Second login: Should be successful
Select your environment (dev, stg, test, uat)

Grafana access

Access Grafana for dashboards and metrics visualization.

Prerequisites

To access Grafana, you need:

Microsoft Entra ID (Azure AD) account
Grafana roles assigned in self-service.tfvars:

team_members = {
  "[email protected]" = {
    grafana = {
      roles = ["dev-rw", "stg-ro"]
    }
  }
}

Available Grafana Roles:

{env}-ro: Read-only access to logs and dashboards
{env}-rw: Full access including creating dashboards, monitors, and alerts

Accessing Grafana

Navigate to: ecosecos.grafana.net
Sign in with your Microsoft Entra ID credentials
Select your workspace/environment

Monitoring connector health

Monitor your connector’s health using logs, metrics, and traces.

Application logs

Viewing logs in DataDog

Navigate to Logs in DataDog
Filter by service: Search for your connector name
Filter by environment: Select dev, stg, test, or uat
Time range: Select appropriate time window

Log Search Examples:

service:my-custom-connector env:dev
service:my-custom-connector status:error
service:my-custom-connector "exception"

Viewing logs in Grafana

Navigate to Explore in Grafana
Select data source: Loki or appropriate log source
Query logs: Use LogQL or similar query language
Filter by labels: service, environment, level

Log levels

Monitor different log levels:

ERROR: Critical issues requiring immediate attention
WARN: Potential issues or unusual conditions
INFO: General operational information
DEBUG: Detailed diagnostic information

Application metrics

Track key performance and resource metrics for your connector.

Key metrics to monitor

Performance Metrics:

Request rate: Requests per second
Response time: P50, P95, P99 latencies
Error rate: Percentage of failed requests
Throughput: Messages processed per second

Resource Metrics:

CPU utilization: Current CPU usage
Memory usage: Memory consumption
Network I/O: Network traffic
Disk I/O: Disk read/write operations

Business Metrics:

Message processing rate: Messages processed successfully
API call success rate: External API call success percentage
Queue depth: Number of pending messages
Processing time: Time to process each message

Viewing metrics in DataDog

Navigate to Metrics or Dashboards
Search for metrics: Use metric names or tags
Create custom dashboards: Visualize key metrics
Set up monitors: Alert on threshold breaches

Viewing metrics in Grafana

Navigate to Dashboards
Select connector dashboard (if available)
Create custom panels: Add new visualizations
Configure alerts: Set up alerting rules

Distributed tracing

Trace requests as they flow through your connector and external services.

Request tracing

Trace requests across services:

View request flow: See how requests traverse through services
Identify bottlenecks: Find slow operations
Error tracking: See where errors occur in the flow

Using DataDog APM

Navigate to APM in DataDog
Select your service: Find your connector
View traces: See individual request traces
Analyze performance: Identify slow endpoints

Health checks

Verify your connector is running and ready to handle requests.

Kubernetes health checks

Monitor connector health at the Kubernetes level:

# Check pod status
kubectl get pods -n <namespace> -l app=my-custom-connector

# View pod events
kubectl describe pod <pod-name> -n <namespace>

# Check readiness
kubectl get endpoints -n <namespace>

Application health endpoints

Monitor application-level health:

Liveness endpoint: /health/live
Readiness endpoint: /health/ready
Metrics endpoint: /metrics (Prometheus format)

Error tracking

Track and analyze errors to identify and resolve issues.

Error monitoring

Track and analyze errors: In DataDog:

Navigate to Error Tracking
Filter by service: Your connector name
View error trends: See error frequency over time
Analyze stack traces: Understand error causes

Common Error Types:

Connection errors: External API connectivity issues
Timeout errors: Slow or unresponsive services
Authentication errors: Invalid credentials or tokens
Validation errors: Invalid data formats
Resource errors: Memory or CPU exhaustion

Alerting

Configure alerts to notify you when issues occur.

Setting up alerts in DataDog

Navigate to Monitors
Create New Monitor
Select metric: Choose metric to monitor
Set thresholds: Define warning and critical thresholds
Configure notifications: Set up notification channels
Save monitor

Example Alert Conditions:

Error rate > 5% for 5 minutes
Response time P95 > 1000ms for 10 minutes
CPU usage > 80% for 5 minutes
Memory usage > 90% for 5 minutes

Setting up alerts in Grafana

Navigate to Alerting
Create Alert Rule
Define query: Select metric and conditions
Set evaluation interval: How often to check
Configure notifications: Set up notification channels
Save alert rule

Dashboards

Create dashboards to visualize key metrics and monitor connector health.

Creating custom dashboards

In DataDog:

Navigate to Dashboards
Create New Dashboard
Add widgets: Graphs, logs, metrics
Configure queries: Set up metric queries
Save dashboard

Recommended Dashboard Panels:

Request rate over time
Error rate percentage
Response time percentiles
CPU and memory usage
Active connections
Queue depth
Recent errors log

In Grafana:

Navigate to Dashboards
Create Dashboard
Add panels: Time series, logs, stat panels
Configure queries: Set up PromQL or similar
Save dashboard

Performance analysis

Analyze performance data to identify and resolve bottlenecks.

Identifying performance issues

Slow Response Times:

Check external API latency: External services may be slow
Review database queries: Optimize slow queries
Analyze thread pool usage: May need more threads
Check resource constraints: CPU or memory limits

High Error Rates:

Review error logs: Identify error patterns
Check external service health: External APIs may be down
Validate input data: Ensure data format is correct
Review authentication: Check token expiration

Resource Exhaustion:

Monitor memory usage: Check for memory leaks
Review CPU usage: Optimize CPU-intensive operations
Check connection pools: May need to increase pool size
Review thread usage: Optimize thread pool configuration

Log archives

Access archived logs for long-term analysis and compliance.

Accessing log archives

For long-term log storage and analysis: Prerequisites:

Azure role: Log archive access role assigned
Storage account access: Access to Azure Storage Account
Key Vault access: For SPN credentials (automation)

Access Methods:

Azure Portal: Browse storage account
Azure Storage Explorer: Desktop application
Azure CLI: Command-line access
AzCopy: High-performance copying

Use Cases:

Compliance: Long-term log retention
Forensics: Investigating historical issues
Analytics: Analyzing trends over time
Auditing: Security and compliance audits

API management monitoring

Monitor API usage and performance through Azure API Management.

APIM analytics

Monitor API usage through Azure APIM: Metrics Available:

Request count: Total API requests
Response time: API response times
Error rate: Failed API calls
Bandwidth: Data transfer volume
Subscription usage: Per-subscription metrics

Access APIM Analytics:

Navigate to Azure Portal
Select APIM instance for your environment
View Analytics: See usage and performance metrics
Export reports: Generate usage reports

Best practices

Set up alerts early: Configure alerts during development
Monitor key metrics: Focus on business-critical metrics
Create dashboards: Visualize important metrics
Review logs regularly: Check logs for issues
Track error trends: Monitor error rates over time
Set up SLOs: Define service level objectives
Document runbooks: Create troubleshooting guides
Regular reviews: Review metrics and alerts periodically

Troubleshooting guide

Use these steps to diagnose and resolve common issues.

High error rate

Check logs: Review error messages in DataDog/Grafana
Identify pattern: Look for common error types
Check external services: Verify external API health
Review recent changes: Check recent deployments
Scale resources: May need more resources

Slow performance

Check response times: Identify slow operations
Review resource usage: Check CPU and memory
Analyze traces: Find bottlenecks in request flow
Check external dependencies: External services may be slow
Optimize code: Review and optimize slow operations

Memory issues

Check memory usage: Monitor memory consumption
Review heap dumps: Analyze memory usage patterns
Check for leaks: Identify memory leaks
Increase limits: Adjust memory limits if needed
Optimize code: Reduce memory footprint

Connection issues

Check network connectivity: Verify network connections
Review connection pool: Check pool configuration
Monitor connection errors: Track connection failures
Check firewall rules: Verify firewall configurations
Review DNS: Check DNS resolution

Getting help

DataDog Support: Use DataDog’s help documentation
Grafana Support: Check Grafana documentation
Team Support: Contact your team lead for access issues
Platform Support: Create support ticket for platform issues

Overview

Architecture and technology

Core capabilities

Development cycle

Security and reliability

Platform administration

Reference

​Overview

​Accessing monitoring tools

​DataDog access

​Prerequisites

​Accessing DataDog

​Grafana access

​Prerequisites

​Accessing Grafana

​Monitoring connector health

​Application logs

​Viewing logs in DataDog

​Viewing logs in Grafana

​Log levels

​Application metrics

​Key metrics to monitor

​Viewing metrics in DataDog

​Viewing metrics in Grafana

​Distributed tracing

​Request tracing

​Using DataDog APM

​Health checks

​Kubernetes health checks

​Application health endpoints

​Error tracking

​Error monitoring

​Alerting

​Setting up alerts in DataDog

​Setting up alerts in Grafana

​Dashboards

​Creating custom dashboards

​Performance analysis

​Identifying performance issues

​Log archives

​Accessing log archives

​API management monitoring

​APIM analytics

​Best practices

​Troubleshooting guide

​High error rate

​Slow performance

​Memory issues

​Connection issues

​Getting help

Overview

Accessing monitoring tools

DataDog access

Prerequisites

Accessing DataDog

Grafana access

Prerequisites

Accessing Grafana

Monitoring connector health

Application logs

Viewing logs in DataDog

Viewing logs in Grafana

Log levels

Application metrics

Key metrics to monitor

Viewing metrics in DataDog

Viewing metrics in Grafana

Distributed tracing

Request tracing

Using DataDog APM

Health checks

Kubernetes health checks

Application health endpoints

Error tracking

Error monitoring

Alerting

Setting up alerts in DataDog

Setting up alerts in Grafana

Dashboards

Creating custom dashboards

Performance analysis

Identifying performance issues

Log archives

Accessing log archives

API management monitoring

APIM analytics

Best practices

Troubleshooting guide

High error rate

Slow performance

Memory issues

Connection issues

Getting help