Overview
The platform provides comprehensive monitoring capabilities through:- DataDog - Application performance monitoring, logs, and metrics
- Grafana - Dashboards, metrics visualization, and alerting
- Kubernetes - Container and cluster-level monitoring
- Azure APIM - API analytics and usage metrics
Accessing monitoring tools
Access monitoring tools to view logs, metrics, and traces for your connectors.DataDog access
Prerequisites
To access DataDog, you need:- Azure access (user must be in Azure AD)
- DataDog roles assigned in
self-service.tfvars:
{env}-ro: Read-only access to logs and metrics{env}-rw: Full access including creating dashboards, synthetics, and monitors
Accessing DataDog
- Navigate to: gc-ecos.datadoghq.eu
- First login: May result in HTTP 500 error (just-in-time provisioning)
- Second login: Should be successful
- Select your environment (dev, stg, test, uat)
Grafana access
Access Grafana for dashboards and metrics visualization.Prerequisites
To access Grafana, you need:- Microsoft Entra ID (Azure AD) account
- Grafana roles assigned in
self-service.tfvars:
{env}-ro: Read-only access to logs and dashboards{env}-rw: Full access including creating dashboards, monitors, and alerts
Accessing Grafana
- Navigate to: ecosecos.grafana.net
- Sign in with your Microsoft Entra ID credentials
- Select your workspace/environment
Monitoring connector health
Monitor your connector’s health using logs, metrics, and traces.Application logs
Viewing logs in DataDog
- Navigate to Logs in DataDog
- Filter by service: Search for your connector name
- Filter by environment: Select dev, stg, test, or uat
- Time range: Select appropriate time window
Viewing logs in Grafana
- Navigate to Explore in Grafana
- Select data source: Loki or appropriate log source
- Query logs: Use LogQL or similar query language
- Filter by labels: service, environment, level
Log levels
Monitor different log levels:- ERROR: Critical issues requiring immediate attention
- WARN: Potential issues or unusual conditions
- INFO: General operational information
- DEBUG: Detailed diagnostic information
Application metrics
Track key performance and resource metrics for your connector.Key metrics to monitor
Performance Metrics:- Request rate: Requests per second
- Response time: P50, P95, P99 latencies
- Error rate: Percentage of failed requests
- Throughput: Messages processed per second
- CPU utilization: Current CPU usage
- Memory usage: Memory consumption
- Network I/O: Network traffic
- Disk I/O: Disk read/write operations
- Message processing rate: Messages processed successfully
- API call success rate: External API call success percentage
- Queue depth: Number of pending messages
- Processing time: Time to process each message
Viewing metrics in DataDog
- Navigate to Metrics or Dashboards
- Search for metrics: Use metric names or tags
- Create custom dashboards: Visualize key metrics
- Set up monitors: Alert on threshold breaches
Viewing metrics in Grafana
- Navigate to Dashboards
- Select connector dashboard (if available)
- Create custom panels: Add new visualizations
- Configure alerts: Set up alerting rules
Distributed tracing
Trace requests as they flow through your connector and external services.Request tracing
Trace requests across services:- View request flow: See how requests traverse through services
- Identify bottlenecks: Find slow operations
- Error tracking: See where errors occur in the flow
Using DataDog APM
- Navigate to APM in DataDog
- Select your service: Find your connector
- View traces: See individual request traces
- Analyze performance: Identify slow endpoints
Health checks
Verify your connector is running and ready to handle requests.Kubernetes health checks
Monitor connector health at the Kubernetes level:Application health endpoints
Monitor application-level health:- Liveness endpoint:
/health/live - Readiness endpoint:
/health/ready - Metrics endpoint:
/metrics(Prometheus format)
Error tracking
Track and analyze errors to identify and resolve issues.Error monitoring
Track and analyze errors: In DataDog:- Navigate to Error Tracking
- Filter by service: Your connector name
- View error trends: See error frequency over time
- Analyze stack traces: Understand error causes
- Connection errors: External API connectivity issues
- Timeout errors: Slow or unresponsive services
- Authentication errors: Invalid credentials or tokens
- Validation errors: Invalid data formats
- Resource errors: Memory or CPU exhaustion
Alerting
Configure alerts to notify you when issues occur.Setting up alerts in DataDog
- Navigate to Monitors
- Create New Monitor
- Select metric: Choose metric to monitor
- Set thresholds: Define warning and critical thresholds
- Configure notifications: Set up notification channels
- Save monitor
- Error rate > 5% for 5 minutes
- Response time P95 > 1000ms for 10 minutes
- CPU usage > 80% for 5 minutes
- Memory usage > 90% for 5 minutes
Setting up alerts in Grafana
- Navigate to Alerting
- Create Alert Rule
- Define query: Select metric and conditions
- Set evaluation interval: How often to check
- Configure notifications: Set up notification channels
- Save alert rule
Dashboards
Create dashboards to visualize key metrics and monitor connector health.Creating custom dashboards
In DataDog:- Navigate to Dashboards
- Create New Dashboard
- Add widgets: Graphs, logs, metrics
- Configure queries: Set up metric queries
- Save dashboard
- Request rate over time
- Error rate percentage
- Response time percentiles
- CPU and memory usage
- Active connections
- Queue depth
- Recent errors log
- Navigate to Dashboards
- Create Dashboard
- Add panels: Time series, logs, stat panels
- Configure queries: Set up PromQL or similar
- Save dashboard
Performance analysis
Analyze performance data to identify and resolve bottlenecks.Identifying performance issues
Slow Response Times:- Check external API latency: External services may be slow
- Review database queries: Optimize slow queries
- Analyze thread pool usage: May need more threads
- Check resource constraints: CPU or memory limits
- Review error logs: Identify error patterns
- Check external service health: External APIs may be down
- Validate input data: Ensure data format is correct
- Review authentication: Check token expiration
- Monitor memory usage: Check for memory leaks
- Review CPU usage: Optimize CPU-intensive operations
- Check connection pools: May need to increase pool size
- Review thread usage: Optimize thread pool configuration
Log archives
Access archived logs for long-term analysis and compliance.Accessing log archives
For long-term log storage and analysis: Prerequisites:- Azure role: Log archive access role assigned
- Storage account access: Access to Azure Storage Account
- Key Vault access: For SPN credentials (automation)
- Azure Portal: Browse storage account
- Azure Storage Explorer: Desktop application
- Azure CLI: Command-line access
- AzCopy: High-performance copying
- Compliance: Long-term log retention
- Forensics: Investigating historical issues
- Analytics: Analyzing trends over time
- Auditing: Security and compliance audits
API management monitoring
Monitor API usage and performance through Azure API Management.APIM analytics
Monitor API usage through Azure APIM: Metrics Available:- Request count: Total API requests
- Response time: API response times
- Error rate: Failed API calls
- Bandwidth: Data transfer volume
- Subscription usage: Per-subscription metrics
- Navigate to Azure Portal
- Select APIM instance for your environment
- View Analytics: See usage and performance metrics
- Export reports: Generate usage reports
Best practices
- Set up alerts early: Configure alerts during development
- Monitor key metrics: Focus on business-critical metrics
- Create dashboards: Visualize important metrics
- Review logs regularly: Check logs for issues
- Track error trends: Monitor error rates over time
- Set up SLOs: Define service level objectives
- Document runbooks: Create troubleshooting guides
- Regular reviews: Review metrics and alerts periodically
Troubleshooting guide
Use these steps to diagnose and resolve common issues.High error rate
- Check logs: Review error messages in DataDog/Grafana
- Identify pattern: Look for common error types
- Check external services: Verify external API health
- Review recent changes: Check recent deployments
- Scale resources: May need more resources
Slow performance
- Check response times: Identify slow operations
- Review resource usage: Check CPU and memory
- Analyze traces: Find bottlenecks in request flow
- Check external dependencies: External services may be slow
- Optimize code: Review and optimize slow operations
Memory issues
- Check memory usage: Monitor memory consumption
- Review heap dumps: Analyze memory usage patterns
- Check for leaks: Identify memory leaks
- Increase limits: Adjust memory limits if needed
- Optimize code: Reduce memory footprint
Connection issues
- Check network connectivity: Verify network connections
- Review connection pool: Check pool configuration
- Monitor connection errors: Track connection failures
- Check firewall rules: Verify firewall configurations
- Review DNS: Check DNS resolution
Getting help
- DataDog Support: Use DataDog’s help documentation
- Grafana Support: Check Grafana documentation
- Team Support: Contact your team lead for access issues
- Platform Support: Create support ticket for platform issues