06.06.Monitoring Requirements

06.06. Monitoring Requirements

1. Centralized & Structured Logging

The backend must switch to JSON-structured logging to ensure consistency in Cloud Logging.

1.1 Specification: JSON Logging

Log Format: All logs must be output as JSON to stdout.
Fields Required:
- message: The main log message.
- severity: Standard level (INFO, WARNING, ERROR, CRITICAL).
- session_id: Persisted from the user's OrderSession if available.
- payment_id: Persisted if in the payment flow.
- vin_suffix: Last 4 digits of the VIN (masked for privacy).
- request_id: From the X-Request-ID header for trace correlation.

1.2 Implementation: ✅ DONE

Implemented via app/core/logging_config.py using stdlib logging + custom CloudJsonFormatter (no structlog dependency needed). Context fields propagated via ContextVar: - Request context middleware in app/main.py extracts X-Request-ID header - TeslaFleetService.get_session() and authorize_and_fetch_vehicles() set session_id - set_log_context() / clear_log_context() API for any code path

2. Distributed Tracing (OpenTelemetry)

Tracing must span from the main API request to any background PDF generation tasks.

2.1 Specification: Trace Correlation

Span Name: Clear name for major operations (e.g., TeslaFleetAPI.get_vehicle_data, PDFService.render_report).
Context Propagation: Pass the traceparent header to background workers (Cloud Tasks).

2.2 Implementation: ✅ DONE

Implemented via app/core/tracing.py using OpenTelemetry SDK with Google Cloud Trace exporter: - Auto-instruments FastAPI (incoming requests), httpx (outgoing HTTP to Tesla/Stripe/PayPal), and Redis - OTEL_ENABLED setting in app/config.py controls activation (default: disabled for local dev) - Cloud Trace exporter when GCP_PROJECT is set; console exporter for local development - Dependencies added to requirements.txt: opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-gcp-trace, and instrumentation packages - GCP APIs monitoring.googleapis.com and cloudtrace.googleapis.com added to step 001 defaults

3. Proactive Alerting (Cloud Monitoring)

Alerts must be created in Terraform to notify the team via Slack/Email.

3.1 Alert Specification: Tesla API Health

Metric: logging.googleapis.com/user/${org}-${app}-${env}-tesla-api-errors
Condition: Count > 10 in 5 minutes (where log contains Tesla error indicators: 401, 403, 429, 5xx).
Notification: Email via Cloud Monitoring notification channel.

3.2 Alert Specification: Payment Failures

Metric: logging.googleapis.com/user/${org}-${app}-${env}-payment-errors
Condition: Count > 5 in 15 minutes.
Criticality: CRITICAL (direct revenue impact).

3.3 Alert Specification: Server 5xx Errors

Metric: logging.googleapis.com/user/${org}-${app}-${env}-server-errors-5xx
Condition: Count > 10 in 5 minutes.
Notification: Email via Cloud Monitoring notification channel.

3.4 Alert Specification: PDF Generation Failures

Metric: logging.googleapis.com/user/${org}-${app}-${env}-pdf-errors
Condition: Count > 5 in 15 minutes.
Notification: Email via Cloud Monitoring notification channel.

3.5 Implementation: ✅ DONE

Implemented via Terraform step 140-gcp-cloud-monitor: - 05.log-based-metrics.tf: 4 custom log-based metrics (Tesla, payment, PDF, 5xx) - 06.notification-channels.tf: email notification channel - 07.alert-policies.tf: 4 alert policies with thresholds, auto-close, and runbook documentation - Config in {env}.env.yaml under 140-gcp-cloud-monitor key (sender/recipient email, region) - Template in bnc-cpt-cnf/src/tpl/%org%-%app%/%env%/tf/140-gcp-cloud-monitor.vars.tfvars.tpl

4. Dashboards (Google Cloud Monitoring)

Create a "Car Pulse Tracker - Health Dashboard" with: - API Throughput: Requests/second per region. - PDF Generation Latency: p95 and p99 for PDF rendering. - Tesla API Error Rate: Percentage of failed outgoing calls to Tesla Fleet API. - Payment Success Rate: Ratio of payment.verify successes to intents created.

4.1 Implementation: ✅ DONE

Implemented via Terraform step 140-gcp-cloud-monitor in 08.dashboard.tf: - 10-panel mosaic dashboard with 4 rows: - Row 1: API Request Rate (req/s) + API Latency p95/p99 - Row 2: Tesla API Errors + Payment Errors + PDF Generation Errors - Row 3: Container CPU + Memory Utilization + 5xx Errors - Row 4: Active Instances + Container Startup Latency - Uses Cloud Run built-in metrics + custom log-based metrics - Dashboard auto-provisioned per environment via Terraform