Skip to content

Startup Cleanup for Stale Sync Progress Records

Overview

The application automatically runs a cleanup task on startup to identify and mark as failed any sync progress records that are stuck in "in_progress" status due to killed tasks, instance restarts, or other failure scenarios.

How It Works

Startup Process

  1. App Starts: FastAPI lifespan context manager executes
  2. Firestore Watchers: Initialize database watchers
  3. Startup Cleanup: Run stale sync cleanup (if enabled)
  4. Local Environment Setup: Configure local databases (if applicable)
  5. App Ready: Application is ready to serve requests

Detection Logic

The startup cleanup uses three detection methods:

  1. Age-Based Detection (default: 30 minutes)
  2. Tasks running longer than threshold are considered stale
  3. Configurable via startup_cleanup_stale_threshold_minutes

  4. Heartbeat-Based Detection (default: 10 minutes on startup)

  5. Tasks with no last_activity_at updates beyond threshold
  6. More aggressive than runtime cleanup (10min vs 15min)
  7. Configurable via startup_cleanup_heartbeat_threshold_minutes

  8. Process-Based Detection

  9. Tasks where the worker PID no longer exists on the same hostname
  10. Only applies to tasks running on the same server

Cleanup Actions

When stale syncs are detected: - Status changed from "in_progress""failed" - Detailed error message explaining the cleanup reason - Comprehensive metadata added to details field:

{
  "cleanup_reason": "stale_by_heartbeat",
  "cleanup_time": "2024-01-01T10:20:00Z",
  "original_worker_hostname": "worker-pod-abc",
  "original_worker_pid": 1234,
  "task_age_minutes": 45.2,
  "last_activity_age_minutes": 25.0
}

Configuration

Environment Variables

# Enable/disable startup cleanup (default: true)
STARTUP_CLEANUP_ENABLED=true

# Age threshold for stale tasks in minutes (default: 30)
STARTUP_CLEANUP_STALE_THRESHOLD_MINUTES=30

# Heartbeat threshold in minutes (default: 10, more aggressive than runtime)
STARTUP_CLEANUP_HEARTBEAT_THRESHOLD_MINUTES=10

Configuration Object

from application.config import app_config

# Check if startup cleanup is enabled
if app_config.startup_cleanup_enabled:
    print(f"Cleanup thresholds: {app_config.startup_cleanup_stale_threshold_minutes}m age, "
          f"{app_config.startup_cleanup_heartbeat_threshold_minutes}m heartbeat")

Logging

Successful Cleanup

INFO: Running startup cleanup of stale sync progress records...
INFO: Startup cleanup completed: cleaned 3 stale sync records

No Stale Records

INFO: Running startup cleanup of stale sync progress records...
INFO: Startup cleanup completed: no stale sync records found

Disabled Cleanup

INFO: Startup cleanup disabled by configuration

Failed Cleanup (Non-Fatal)

INFO: Running startup cleanup of stale sync progress records...
ERROR: Startup cleanup failed (continuing anyway): Database connection timeout

Operational Benefits

Deployment Recovery

  • Problem: After deployments, some tasks may be killed mid-execution
  • Solution: Startup cleanup marks them as failed, allowing retries

Instance Restart Recovery

  • Problem: Cloud Run instances restart, leaving orphaned in-progress records
  • Solution: Next instance startup cleans up previous instance's stale tasks

Database Consistency

  • Problem: Stuck "in_progress" records prevent new syncs and cause confusion
  • Solution: Clean slate on every startup ensures consistent state

Monitoring & Alerting

  • Benefit: Clear distinction between active tasks and failed tasks
  • Metrics: Startup cleanup stats help identify infrastructure issues

Safety Features

Non-Blocking Startup

  • Cleanup failures don't prevent app startup
  • Errors are logged but application continues normally
  • Essential for production reliability

Conservative Thresholds

  • Default 30-minute age threshold prevents premature cleanup
  • Configurable thresholds allow environment-specific tuning
  • Multiple detection methods reduce false positives

Comprehensive Audit Trail

  • All cleanup actions are fully logged
  • Original worker information preserved
  • Detailed timing and reason metadata
  • Enables post-mortem analysis

Troubleshooting

High Cleanup Counts

If startup cleanup consistently finds many stale records: - Check: Worker instance stability - Consider: Reducing task complexity or increasing timeouts - Monitor: Worker memory and CPU usage

Cleanup Failures

If startup cleanup fails: - Check: Database connectivity - Verify: Firestore permissions - Monitor: Application startup logs

False Positives

If healthy tasks are being cleaned up: - Increase: startup_cleanup_stale_threshold_minutes - Check: Task heartbeat frequency - Verify: System clock synchronization

  • Runtime Cleanup: Scheduled cleanup during normal operation
  • API Endpoints: Manual cleanup via /api/v1/sync-progress/cleanup
  • Standalone Script: src/bin/cleanup_stale_syncs.py for cron jobs
  • Internal Endpoint: /internal/email/cleanup-stale-syncs for Cloud Scheduler

Robust SyncProgress Handling (June 2025)

The transaction sync system now mirrors the email system's robust SyncProgress handling: - SyncProgress is updated at every stage (pending, in_progress, completed, failed), with detailed context and error information. - Failed syncs are automatically reset to 'pending' for retry, ensuring reliability and idempotency. - All orchestrator and worker code now uses the same update pattern as the email system, with structured logging and error handling. - Tests have been updated to assert on sync progress updates, ensuring regression safety and business-driven coverage. - This closes the gap with the email system for progress tracking, error recovery, and user experience. - All relevant code and tests now pass, confirming the system is robust and production-ready.