Startup Cleanup for Stale Sync Progress Records¶

Overview¶

The application automatically runs a cleanup task on startup to identify and mark as failed any sync progress records that are stuck in "in_progress" status due to killed tasks, instance restarts, or other failure scenarios.

How It Works¶

Startup Process¶

App Starts: FastAPI lifespan context manager executes
Firestore Watchers: Initialize database watchers
Startup Cleanup: Run stale sync cleanup (if enabled)
Local Environment Setup: Configure local databases (if applicable)
App Ready: Application is ready to serve requests

Detection Logic¶

The startup cleanup uses three detection methods:

Age-Based Detection (default: 30 minutes)
Tasks running longer than threshold are considered stale
Configurable via startup_cleanup_stale_threshold_minutes
Heartbeat-Based Detection (default: 10 minutes on startup)
Tasks with no last_activity_at updates beyond threshold
More aggressive than runtime cleanup (10min vs 15min)
Configurable via startup_cleanup_heartbeat_threshold_minutes
Process-Based Detection
Tasks where the worker PID no longer exists on the same hostname
Only applies to tasks running on the same server

Cleanup Actions¶

When stale syncs are detected: - Status changed from "in_progress" → "failed" - Detailed error message explaining the cleanup reason - Comprehensive metadata added to details field:

{
  "cleanup_reason": "stale_by_heartbeat",
  "cleanup_time": "2024-01-01T10:20:00Z",
  "original_worker_hostname": "worker-pod-abc",
  "original_worker_pid": 1234,
  "task_age_minutes": 45.2,
  "last_activity_age_minutes": 25.0
}

Configuration¶

Environment Variables¶

# Enable/disable startup cleanup (default: true)
STARTUP_CLEANUP_ENABLED=true

# Age threshold for stale tasks in minutes (default: 30)
STARTUP_CLEANUP_STALE_THRESHOLD_MINUTES=30

# Heartbeat threshold in minutes (default: 10, more aggressive than runtime)
STARTUP_CLEANUP_HEARTBEAT_THRESHOLD_MINUTES=10

Configuration Object¶

from application.config import app_config

# Check if startup cleanup is enabled
if app_config.startup_cleanup_enabled:
    print(f"Cleanup thresholds: {app_config.startup_cleanup_stale_threshold_minutes}m age, "
          f"{app_config.startup_cleanup_heartbeat_threshold_minutes}m heartbeat")

Logging¶

Successful Cleanup¶

INFO: Running startup cleanup of stale sync progress records...
INFO: Startup cleanup completed: cleaned 3 stale sync records

No Stale Records¶

INFO: Running startup cleanup of stale sync progress records...
INFO: Startup cleanup completed: no stale sync records found

Disabled Cleanup¶

INFO: Startup cleanup disabled by configuration

Failed Cleanup (Non-Fatal)¶

INFO: Running startup cleanup of stale sync progress records...
ERROR: Startup cleanup failed (continuing anyway): Database connection timeout

Operational Benefits¶

Deployment Recovery¶

Problem: After deployments, some tasks may be killed mid-execution
Solution: Startup cleanup marks them as failed, allowing retries

Instance Restart Recovery¶

Problem: Cloud Run instances restart, leaving orphaned in-progress records
Solution: Next instance startup cleans up previous instance's stale tasks

Database Consistency¶

Problem: Stuck "in_progress" records prevent new syncs and cause confusion
Solution: Clean slate on every startup ensures consistent state

Monitoring & Alerting¶

Benefit: Clear distinction between active tasks and failed tasks
Metrics: Startup cleanup stats help identify infrastructure issues

Safety Features¶

Non-Blocking Startup¶

Cleanup failures don't prevent app startup
Errors are logged but application continues normally
Essential for production reliability

Conservative Thresholds¶

Default 30-minute age threshold prevents premature cleanup
Configurable thresholds allow environment-specific tuning
Multiple detection methods reduce false positives

Comprehensive Audit Trail¶

All cleanup actions are fully logged
Original worker information preserved
Detailed timing and reason metadata
Enables post-mortem analysis

Troubleshooting¶

High Cleanup Counts¶

If startup cleanup consistently finds many stale records: - Check: Worker instance stability - Consider: Reducing task complexity or increasing timeouts - Monitor: Worker memory and CPU usage

Cleanup Failures¶

If startup cleanup fails: - Check: Database connectivity - Verify: Firestore permissions - Monitor: Application startup logs

False Positives¶

If healthy tasks are being cleaned up: - Increase: startup_cleanup_stale_threshold_minutes - Check: Task heartbeat frequency - Verify: System clock synchronization

Runtime Cleanup: Scheduled cleanup during normal operation
API Endpoints: Manual cleanup via /api/v1/sync-progress/cleanup
Standalone Script: src/bin/cleanup_stale_syncs.py for cron jobs
Internal Endpoint: /internal/email/cleanup-stale-syncs for Cloud Scheduler

Robust SyncProgress Handling (June 2025)¶

The transaction sync system now mirrors the email system's robust SyncProgress handling: - SyncProgress is updated at every stage (pending, in_progress, completed, failed), with detailed context and error information. - Failed syncs are automatically reset to 'pending' for retry, ensuring reliability and idempotency. - All orchestrator and worker code now uses the same update pattern as the email system, with structured logging and error handling. - Tests have been updated to assert on sync progress updates, ensuring regression safety and business-driven coverage. - This closes the gap with the email system for progress tracking, error recovery, and user experience. - All relevant code and tests now pass, confirming the system is robust and production-ready.