06-Operations / 06.06.Monitoring-Requirements

06.06.Monitoring Requirements

06.06. Monitoring Requirements

1. Centralized & Structured Logging

The backend must switch to JSON-structured logging to ensure consistency in Cloud Logging.

1.1 Specification: JSON Logging

1.2 Implementation: ✅ DONE

Implemented via app/core/logging_config.py using stdlib logging + custom CloudJsonFormatter (no structlog dependency needed). Context fields propagated via ContextVar: - Request context middleware in app/main.py extracts X-Request-ID header - TeslaFleetService.get_session() and authorize_and_fetch_vehicles() set session_id - set_log_context() / clear_log_context() API for any code path


2. Distributed Tracing (OpenTelemetry)

Tracing must span from the main API request to any background PDF generation tasks.

2.1 Specification: Trace Correlation

2.2 Implementation: ✅ DONE

Implemented via app/core/tracing.py using OpenTelemetry SDK with Google Cloud Trace exporter: - Auto-instruments FastAPI (incoming requests), httpx (outgoing HTTP to Tesla/Stripe/PayPal), and Redis - OTEL_ENABLED setting in app/config.py controls activation (default: disabled for local dev) - Cloud Trace exporter when GCP_PROJECT is set; console exporter for local development - Dependencies added to requirements.txt: opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-gcp-trace, and instrumentation packages - GCP APIs monitoring.googleapis.com and cloudtrace.googleapis.com added to step 001 defaults


3. Proactive Alerting (Cloud Monitoring)

Alerts must be created in Terraform to notify the team via Slack/Email.

3.1 Alert Specification: Tesla API Health

3.2 Alert Specification: Payment Failures

3.3 Alert Specification: Server 5xx Errors

3.4 Alert Specification: PDF Generation Failures

3.5 Implementation: ✅ DONE

Implemented via Terraform step 140-gcp-cloud-monitor: - 05.log-based-metrics.tf: 4 custom log-based metrics (Tesla, payment, PDF, 5xx) - 06.notification-channels.tf: email notification channel - 07.alert-policies.tf: 4 alert policies with thresholds, auto-close, and runbook documentation - Config in {env}.env.yaml under 140-gcp-cloud-monitor key (sender/recipient email, region) - Template in bnc-cpt-cnf/src/tpl/%org%-%app%/%env%/tf/140-gcp-cloud-monitor.vars.tfvars.tpl


4. Dashboards (Google Cloud Monitoring)

Create a "Car Pulse Tracker - Health Dashboard" with: - API Throughput: Requests/second per region. - PDF Generation Latency: p95 and p99 for PDF rendering. - Tesla API Error Rate: Percentage of failed outgoing calls to Tesla Fleet API. - Payment Success Rate: Ratio of payment.verify successes to intents created.

4.1 Implementation: ✅ DONE

Implemented via Terraform step 140-gcp-cloud-monitor in 08.dashboard.tf: - 10-panel mosaic dashboard with 4 rows: - Row 1: API Request Rate (req/s) + API Latency p95/p99 - Row 2: Tesla API Errors + Payment Errors + PDF Generation Errors - Row 3: Container CPU + Memory Utilization + 5xx Errors - Row 4: Active Instances + Container Startup Latency - Uses Cloud Run built-in metrics + custom log-based metrics - Dashboard auto-provisioned per environment via Terraform