Startup Cleanup for Stale Sync Progress Records¶
Overview¶
The application automatically runs a cleanup task on startup to identify and mark as failed any sync progress records that are stuck in "in_progress" status due to killed tasks, instance restarts, or other failure scenarios.
How It Works¶
Startup Process¶
- App Starts: FastAPI lifespan context manager executes
- Firestore Watchers: Initialize database watchers
- Startup Cleanup: Run stale sync cleanup (if enabled)
- Local Environment Setup: Configure local databases (if applicable)
- App Ready: Application is ready to serve requests
Detection Logic¶
The startup cleanup uses three detection methods:
- Age-Based Detection (default: 30 minutes)
- Tasks running longer than threshold are considered stale
-
Configurable via
startup_cleanup_stale_threshold_minutes -
Heartbeat-Based Detection (default: 10 minutes on startup)
- Tasks with no
last_activity_atupdates beyond threshold - More aggressive than runtime cleanup (10min vs 15min)
-
Configurable via
startup_cleanup_heartbeat_threshold_minutes -
Process-Based Detection
- Tasks where the worker PID no longer exists on the same hostname
- Only applies to tasks running on the same server
Cleanup Actions¶
When stale syncs are detected:
- Status changed from "in_progress" → "failed"
- Detailed error message explaining the cleanup reason
- Comprehensive metadata added to details field:
{
"cleanup_reason": "stale_by_heartbeat",
"cleanup_time": "2024-01-01T10:20:00Z",
"original_worker_hostname": "worker-pod-abc",
"original_worker_pid": 1234,
"task_age_minutes": 45.2,
"last_activity_age_minutes": 25.0
}
Configuration¶
Environment Variables¶
# Enable/disable startup cleanup (default: true)
STARTUP_CLEANUP_ENABLED=true
# Age threshold for stale tasks in minutes (default: 30)
STARTUP_CLEANUP_STALE_THRESHOLD_MINUTES=30
# Heartbeat threshold in minutes (default: 10, more aggressive than runtime)
STARTUP_CLEANUP_HEARTBEAT_THRESHOLD_MINUTES=10
Configuration Object¶
from application.config import app_config
# Check if startup cleanup is enabled
if app_config.startup_cleanup_enabled:
print(f"Cleanup thresholds: {app_config.startup_cleanup_stale_threshold_minutes}m age, "
f"{app_config.startup_cleanup_heartbeat_threshold_minutes}m heartbeat")
Logging¶
Successful Cleanup¶
INFO: Running startup cleanup of stale sync progress records...
INFO: Startup cleanup completed: cleaned 3 stale sync records
No Stale Records¶
INFO: Running startup cleanup of stale sync progress records...
INFO: Startup cleanup completed: no stale sync records found
Disabled Cleanup¶
INFO: Startup cleanup disabled by configuration
Failed Cleanup (Non-Fatal)¶
INFO: Running startup cleanup of stale sync progress records...
ERROR: Startup cleanup failed (continuing anyway): Database connection timeout
Operational Benefits¶
Deployment Recovery¶
- Problem: After deployments, some tasks may be killed mid-execution
- Solution: Startup cleanup marks them as failed, allowing retries
Instance Restart Recovery¶
- Problem: Cloud Run instances restart, leaving orphaned in-progress records
- Solution: Next instance startup cleans up previous instance's stale tasks
Database Consistency¶
- Problem: Stuck "in_progress" records prevent new syncs and cause confusion
- Solution: Clean slate on every startup ensures consistent state
Monitoring & Alerting¶
- Benefit: Clear distinction between active tasks and failed tasks
- Metrics: Startup cleanup stats help identify infrastructure issues
Safety Features¶
Non-Blocking Startup¶
- Cleanup failures don't prevent app startup
- Errors are logged but application continues normally
- Essential for production reliability
Conservative Thresholds¶
- Default 30-minute age threshold prevents premature cleanup
- Configurable thresholds allow environment-specific tuning
- Multiple detection methods reduce false positives
Comprehensive Audit Trail¶
- All cleanup actions are fully logged
- Original worker information preserved
- Detailed timing and reason metadata
- Enables post-mortem analysis
Troubleshooting¶
High Cleanup Counts¶
If startup cleanup consistently finds many stale records: - Check: Worker instance stability - Consider: Reducing task complexity or increasing timeouts - Monitor: Worker memory and CPU usage
Cleanup Failures¶
If startup cleanup fails: - Check: Database connectivity - Verify: Firestore permissions - Monitor: Application startup logs
False Positives¶
If healthy tasks are being cleaned up:
- Increase: startup_cleanup_stale_threshold_minutes
- Check: Task heartbeat frequency
- Verify: System clock synchronization
Related Components¶
- Runtime Cleanup: Scheduled cleanup during normal operation
- API Endpoints: Manual cleanup via
/api/v1/sync-progress/cleanup - Standalone Script:
src/bin/cleanup_stale_syncs.pyfor cron jobs - Internal Endpoint:
/internal/email/cleanup-stale-syncsfor Cloud Scheduler
Robust SyncProgress Handling (June 2025)¶
The transaction sync system now mirrors the email system's robust SyncProgress handling: - SyncProgress is updated at every stage (pending, in_progress, completed, failed), with detailed context and error information. - Failed syncs are automatically reset to 'pending' for retry, ensuring reliability and idempotency. - All orchestrator and worker code now uses the same update pattern as the email system, with structured logging and error handling. - Tests have been updated to assert on sync progress updates, ensuring regression safety and business-driven coverage. - This closes the gap with the email system for progress tracking, error recovery, and user experience. - All relevant code and tests now pass, confirming the system is robust and production-ready.