Skip to main content
This page covers how to monitor, observe, and troubleshoot your custom connectors in production using the available monitoring and observability tools.

Overview

The platform provides comprehensive monitoring capabilities through:
  • DataDog - Application performance monitoring, logs, and metrics
  • Grafana - Dashboards, metrics visualization, and alerting
  • Kubernetes - Container and cluster-level monitoring
  • Azure APIM - API analytics and usage metrics

Accessing monitoring tools

Access monitoring tools to view logs, metrics, and traces for your connectors.

DataDog access

Prerequisites

To access DataDog, you need:
  • Azure access (user must be in Azure AD)
  • DataDog roles assigned in self-service.tfvars:
team_members = {
  "[email protected]" = {
    datadog = {
      roles = ["dev-rw", "stg-ro"]
    }
  }
}
Available DataDog Roles:
  • {env}-ro: Read-only access to logs and metrics
  • {env}-rw: Full access including creating dashboards, synthetics, and monitors

Accessing DataDog

  1. Navigate to: gc-ecos.datadoghq.eu
  2. First login: May result in HTTP 500 error (just-in-time provisioning)
  3. Second login: Should be successful
  4. Select your environment (dev, stg, test, uat)

Grafana access

Access Grafana for dashboards and metrics visualization.

Prerequisites

To access Grafana, you need:
  • Microsoft Entra ID (Azure AD) account
  • Grafana roles assigned in self-service.tfvars:
team_members = {
  "[email protected]" = {
    grafana = {
      roles = ["dev-rw", "stg-ro"]
    }
  }
}
Available Grafana Roles:
  • {env}-ro: Read-only access to logs and dashboards
  • {env}-rw: Full access including creating dashboards, monitors, and alerts

Accessing Grafana

  1. Navigate to: ecosecos.grafana.net
  2. Sign in with your Microsoft Entra ID credentials
  3. Select your workspace/environment

Monitoring connector health

Monitor your connector’s health using logs, metrics, and traces.

Application logs

Viewing logs in DataDog

  1. Navigate to Logs in DataDog
  2. Filter by service: Search for your connector name
  3. Filter by environment: Select dev, stg, test, or uat
  4. Time range: Select appropriate time window
Log Search Examples:
service:my-custom-connector env:dev
service:my-custom-connector status:error
service:my-custom-connector "exception"

Viewing logs in Grafana

  1. Navigate to Explore in Grafana
  2. Select data source: Loki or appropriate log source
  3. Query logs: Use LogQL or similar query language
  4. Filter by labels: service, environment, level

Log levels

Monitor different log levels:
  • ERROR: Critical issues requiring immediate attention
  • WARN: Potential issues or unusual conditions
  • INFO: General operational information
  • DEBUG: Detailed diagnostic information

Application metrics

Track key performance and resource metrics for your connector.

Key metrics to monitor

Performance Metrics:
  • Request rate: Requests per second
  • Response time: P50, P95, P99 latencies
  • Error rate: Percentage of failed requests
  • Throughput: Messages processed per second
Resource Metrics:
  • CPU utilization: Current CPU usage
  • Memory usage: Memory consumption
  • Network I/O: Network traffic
  • Disk I/O: Disk read/write operations
Business Metrics:
  • Message processing rate: Messages processed successfully
  • API call success rate: External API call success percentage
  • Queue depth: Number of pending messages
  • Processing time: Time to process each message

Viewing metrics in DataDog

  1. Navigate to Metrics or Dashboards
  2. Search for metrics: Use metric names or tags
  3. Create custom dashboards: Visualize key metrics
  4. Set up monitors: Alert on threshold breaches

Viewing metrics in Grafana

  1. Navigate to Dashboards
  2. Select connector dashboard (if available)
  3. Create custom panels: Add new visualizations
  4. Configure alerts: Set up alerting rules

Distributed tracing

Trace requests as they flow through your connector and external services.

Request tracing

Trace requests across services:
  • View request flow: See how requests traverse through services
  • Identify bottlenecks: Find slow operations
  • Error tracking: See where errors occur in the flow

Using DataDog APM

  1. Navigate to APM in DataDog
  2. Select your service: Find your connector
  3. View traces: See individual request traces
  4. Analyze performance: Identify slow endpoints

Health checks

Verify your connector is running and ready to handle requests.

Kubernetes health checks

Monitor connector health at the Kubernetes level:
# Check pod status
kubectl get pods -n <namespace> -l app=my-custom-connector

# View pod events
kubectl describe pod <pod-name> -n <namespace>

# Check readiness
kubectl get endpoints -n <namespace>

Application health endpoints

Monitor application-level health:
  • Liveness endpoint: /health/live
  • Readiness endpoint: /health/ready
  • Metrics endpoint: /metrics (Prometheus format)

Error tracking

Track and analyze errors to identify and resolve issues.

Error monitoring

Track and analyze errors: In DataDog:
  1. Navigate to Error Tracking
  2. Filter by service: Your connector name
  3. View error trends: See error frequency over time
  4. Analyze stack traces: Understand error causes
Common Error Types:
  • Connection errors: External API connectivity issues
  • Timeout errors: Slow or unresponsive services
  • Authentication errors: Invalid credentials or tokens
  • Validation errors: Invalid data formats
  • Resource errors: Memory or CPU exhaustion

Alerting

Configure alerts to notify you when issues occur.

Setting up alerts in DataDog

  1. Navigate to Monitors
  2. Create New Monitor
  3. Select metric: Choose metric to monitor
  4. Set thresholds: Define warning and critical thresholds
  5. Configure notifications: Set up notification channels
  6. Save monitor
Example Alert Conditions:
  • Error rate > 5% for 5 minutes
  • Response time P95 > 1000ms for 10 minutes
  • CPU usage > 80% for 5 minutes
  • Memory usage > 90% for 5 minutes

Setting up alerts in Grafana

  1. Navigate to Alerting
  2. Create Alert Rule
  3. Define query: Select metric and conditions
  4. Set evaluation interval: How often to check
  5. Configure notifications: Set up notification channels
  6. Save alert rule

Dashboards

Create dashboards to visualize key metrics and monitor connector health.

Creating custom dashboards

In DataDog:
  1. Navigate to Dashboards
  2. Create New Dashboard
  3. Add widgets: Graphs, logs, metrics
  4. Configure queries: Set up metric queries
  5. Save dashboard
Recommended Dashboard Panels:
  • Request rate over time
  • Error rate percentage
  • Response time percentiles
  • CPU and memory usage
  • Active connections
  • Queue depth
  • Recent errors log
In Grafana:
  1. Navigate to Dashboards
  2. Create Dashboard
  3. Add panels: Time series, logs, stat panels
  4. Configure queries: Set up PromQL or similar
  5. Save dashboard

Performance analysis

Analyze performance data to identify and resolve bottlenecks.

Identifying performance issues

Slow Response Times:
  1. Check external API latency: External services may be slow
  2. Review database queries: Optimize slow queries
  3. Analyze thread pool usage: May need more threads
  4. Check resource constraints: CPU or memory limits
High Error Rates:
  1. Review error logs: Identify error patterns
  2. Check external service health: External APIs may be down
  3. Validate input data: Ensure data format is correct
  4. Review authentication: Check token expiration
Resource Exhaustion:
  1. Monitor memory usage: Check for memory leaks
  2. Review CPU usage: Optimize CPU-intensive operations
  3. Check connection pools: May need to increase pool size
  4. Review thread usage: Optimize thread pool configuration

Log archives

Access archived logs for long-term analysis and compliance.

Accessing log archives

For long-term log storage and analysis: Prerequisites:
  • Azure role: Log archive access role assigned
  • Storage account access: Access to Azure Storage Account
  • Key Vault access: For SPN credentials (automation)
Access Methods:
  • Azure Portal: Browse storage account
  • Azure Storage Explorer: Desktop application
  • Azure CLI: Command-line access
  • AzCopy: High-performance copying
Use Cases:
  • Compliance: Long-term log retention
  • Forensics: Investigating historical issues
  • Analytics: Analyzing trends over time
  • Auditing: Security and compliance audits

API management monitoring

Monitor API usage and performance through Azure API Management.

APIM analytics

Monitor API usage through Azure APIM: Metrics Available:
  • Request count: Total API requests
  • Response time: API response times
  • Error rate: Failed API calls
  • Bandwidth: Data transfer volume
  • Subscription usage: Per-subscription metrics
Access APIM Analytics:
  1. Navigate to Azure Portal
  2. Select APIM instance for your environment
  3. View Analytics: See usage and performance metrics
  4. Export reports: Generate usage reports

Best practices

  1. Set up alerts early: Configure alerts during development
  2. Monitor key metrics: Focus on business-critical metrics
  3. Create dashboards: Visualize important metrics
  4. Review logs regularly: Check logs for issues
  5. Track error trends: Monitor error rates over time
  6. Set up SLOs: Define service level objectives
  7. Document runbooks: Create troubleshooting guides
  8. Regular reviews: Review metrics and alerts periodically

Troubleshooting guide

Use these steps to diagnose and resolve common issues.

High error rate

  1. Check logs: Review error messages in DataDog/Grafana
  2. Identify pattern: Look for common error types
  3. Check external services: Verify external API health
  4. Review recent changes: Check recent deployments
  5. Scale resources: May need more resources

Slow performance

  1. Check response times: Identify slow operations
  2. Review resource usage: Check CPU and memory
  3. Analyze traces: Find bottlenecks in request flow
  4. Check external dependencies: External services may be slow
  5. Optimize code: Review and optimize slow operations

Memory issues

  1. Check memory usage: Monitor memory consumption
  2. Review heap dumps: Analyze memory usage patterns
  3. Check for leaks: Identify memory leaks
  4. Increase limits: Adjust memory limits if needed
  5. Optimize code: Reduce memory footprint

Connection issues

  1. Check network connectivity: Verify network connections
  2. Review connection pool: Check pool configuration
  3. Monitor connection errors: Track connection failures
  4. Check firewall rules: Verify firewall configurations
  5. Review DNS: Check DNS resolution

Getting help

  • DataDog Support: Use DataDog’s help documentation
  • Grafana Support: Check Grafana documentation
  • Team Support: Contact your team lead for access issues
  • Platform Support: Create support ticket for platform issues