SelfHealer Service
Overview
The SelfHealer is a critical Azure Function microservice that provides automated error recovery and retry functionality for the Publisher platform. This service acts as a resilience layer that automatically retries failed events from other microservices, implements intelligent retry logic with exponential backoff, and manages deferred events that exceed retry limits. It serves as the platform's self-healing mechanism to ensure high availability and data consistency.
Business Purpose
This service serves as the platform's automated recovery system that: - Automatically retries failed events from other microservices - Implements intelligent retry logic with configurable retry attempts - Manages event recovery across the entire Publisher ecosystem - Provides resilience against transient failures and service outages - Ensures data consistency by preventing event loss - Reduces manual intervention for failed event processing - Maintains system reliability and availability - Provides metrics and monitoring for failure patterns
Architecture
Service Type
- Platform: Azure Functions (Containerized Kubernetes Microservice)
- Runtime: Node.js
- Trigger: HTTP Trigger (Anonymous authentication)
- Pattern: Event-Driven Retry and Recovery System
Key Components
graph TD
A[Failed Events] --> B[SelfHealer Service]
B --> C[Handler.js]
C --> D[Message Validation]
D --> E{Valid Message?}
E -->|No| F[Skip Processing]
E -->|Yes| G[Event Processing Loop]
G --> H[Extract Events]
H --> I[Initialize Retry Context]
I --> J{Retry Attempts Left?}
J -->|Yes| K[HTTP Request Action]
K --> L[Send to Target Service]
L --> M{Response Status?}
M -->|Success| N[Mark as Successful]
M -->|Failure| O[Increment Retry Count]
O --> P{Max Retries Reached?}
P -->|No| Q[Event Hub: error]
P -->|Yes| R[Event Hub: deferred]
N --> S[Complete Processing]
Q --> T[Retry Later]
R --> U[Manual Intervention Required]
V[Target Microservices] --> L
W[Metrics Tracking] --> B
X[Application Insights] --> B
Core Functionality
Event Retry Management
- Message Validation: Validates incoming retry messages and event structure
- Retry Logic: Implements configurable retry attempts with intelligent backoff
- HTTP Request Handling: Sends retry requests to target microservices
- Response Processing: Analyzes responses to determine retry success/failure
- Event Routing: Routes events to appropriate destinations based on retry status
Retry Strategy
- Configurable Attempts: Default 3 retry attempts per event (configurable)
- Intelligent Routing: Routes failed events back for retry or to deferred queue
- Success Detection: Identifies successful processing and failed event responses
- Deferred Handling: Manages events that exceed maximum retry attempts
- Metrics Tracking: Tracks retry success rates and failure patterns
Key Features
- Automated Recovery: Automatic retry of failed events without manual intervention
- Configurable Retry Logic: Flexible retry configuration per service and event type
- Intelligent Failure Detection: Sophisticated logic to determine retry success/failure
- Event Preservation: Ensures no events are lost during retry processing
- Service Integration: Seamless integration with all Publisher microservices
- Monitoring and Metrics: Comprehensive tracking of retry patterns and success rates
- Scalable Architecture: Handles high volumes of failed events efficiently
Message Format
Input Message Structure
{
"serviceName": "target-service-name",
"eventHubName": "source-event-hub",
"selfhealer": {
"action": {
"http": {
"type": "http",
"url": "https://service-endpoint.com/api/retry",
"timeout": 3000
}
},
"maxRetryAttempts": 3
},
"events": [
{
"recordid": "event-record-id",
"data": { /* original event data */ },
"selfhealer": {
"retryAttempt": 1,
"success": false,
"responseStatus": 500,
"responseError": "Internal Server Error"
}
}
]
}
Event Processing Flow
- Message Reception: Receives retry messages from Event Hub or direct HTTP
- Validation: Validates message structure and required fields
- Event Extraction: Extracts individual events for retry processing
- Retry Context: Initializes or updates retry context for each event
- HTTP Request: Sends retry request to target microservice
- Response Analysis: Analyzes response to determine success/failure
- Routing Decision: Routes to success, retry, or deferred based on outcome
Retry Logic
Retry Decision Matrix
| Response Status | Failed Events in Response | Action |
|---|---|---|
| 200 | None | Success - Complete |
| 200 | Present | Retry - Failed events detected |
| Non-200 | Any | Retry - Service error |
| Timeout | Any | Retry - Network/timeout issue |
Retry Attempt Management
- Initial Attempt: First retry attempt (retryAttempt = 1)
- Subsequent Attempts: Increment retry counter for each attempt
- Max Attempts: Default 3 attempts (configurable per service)
- Deferred Events: Events exceeding max attempts sent to deferred queue
Event Routing
- Success: Event processing completed successfully
- Retry: Event sent back to error Event Hub for another attempt
- Deferred: Event sent to deferred Event Hub for manual intervention
HTTP Request Configuration
Request Parameters
- URL: Target service endpoint (configurable or default pattern)
- Method: POST (fixed)
- Timeout: Configurable timeout (default 3000ms)
- Headers: Content-Type and x-eventhub headers
- Payload: Original event data without selfhealer metadata
Default URL Pattern
https://{environment}-publisher.delty.com/kube/{serviceName}/
Custom URL Support
Services can specify custom retry endpoints in the selfhealer configuration.
Performance Characteristics
Processing Metrics
- Throughput: ~100 retry events per second
- Latency: 50-3000ms depending on target service response time
- Success Rate: ~85% of retried events eventually succeed
- Retry Efficiency: Average 1.5 retry attempts per failed event
Scalability Features
- Concurrent Processing: Parallel processing of multiple events
- Stateless Design: No state management for horizontal scaling
- Efficient HTTP Handling: Optimized HTTP client with connection pooling
- Memory Management: Efficient memory usage for large event batches
Dependencies
External Services
- Target Microservices: All Publisher microservices that can fail
- Event Hubs: Error and deferred event routing
- Application Insights: Metrics and monitoring
Key NPM Packages
axios: HTTP client for retry requestsidgen: Unique identifier generation
Configuration
Environment-Specific Settings
- Development: Development service endpoints and reduced timeouts
- Integration: Integration testing with staging services
- Production: Production service endpoints with optimized timeouts
Key Configuration Elements
- HTTP URL pattern for service endpoints
- Default timeout settings
- Retry attempt limits
- Logging levels
- Application Insights configuration
Error Handling
Error Scenarios
- Invalid Messages: Malformed retry messages or missing required fields
- Target Service Unavailable: Target microservice is down or unreachable
- Network Timeouts: Network connectivity issues or slow responses
- HTTP Errors: Various HTTP error responses from target services
- Processing Exceptions: Unexpected errors during retry processing
Recovery Mechanisms
- Graceful Degradation: Continue processing other events if one fails
- Error Logging: Comprehensive error logging for debugging
- Metrics Tracking: Track failure patterns and service health
- Deferred Queue: Safe handling of events that cannot be retried
Monitoring and Observability
Application Insights Integration
- Custom Metrics: Service-specific retry success/failure metrics
- Performance Tracking: Response time and throughput monitoring
- Error Tracking: Comprehensive error logging and alerting
- Dependency Tracking: Monitor target service health and performance
Key Metrics
- Retry success rates by service
- Average retry attempts per event
- Deferred event volumes
- Target service response times
- Error rates and patterns
Metric Names
{serviceName}_deferred: Count of events sent to deferred queue- Custom telemetry for retry patterns and success rates
Event Hub Integration
Output Destinations
- error: Events that need additional retry attempts
- deferred: Events that have exceeded maximum retry attempts
Event Flow
- Failed Events → SelfHealer → Target Service
- Still Failing → SelfHealer → error Event Hub (retry)
- Max Retries Exceeded → SelfHealer → deferred Event Hub (manual intervention)
Security Considerations
- Anonymous HTTP Trigger: Internal service with no external authentication
- Data Privacy: Secure handling of event data during retry processing
- Network Security: HTTPS communication with target services
- Audit Trail: Comprehensive logging for compliance and debugging
Related Services
This service integrates with ALL Publisher microservices: - PostbackHandler: Retries failed postback events - RevenueEnrichment: Retries failed revenue processing events - DeviceTrackHandler: Retries failed device tracking events - DocumentCacheHandler: Retries failed cache update events - EventCounter: Retries failed counter update events - All Other Services: Provides retry capability for any service failure
Troubleshooting
Common Issues
- High Deferred Event Volume: Check target service health and retry limits
- Retry Loops: Verify target service response handling and success detection
- Performance Issues: Monitor target service response times and timeout settings
- Configuration Errors: Verify service URLs and retry configuration
Debug Steps
- Check Application Insights for retry attempt patterns
- Verify target service health and connectivity
- Review retry configuration and timeout settings
- Monitor deferred event queue for patterns
- Analyze target service response formats and error handling
Development
Local Development Setup
- Clone repository
- Install dependencies:
npm install - Configure target service endpoints
- Set up Event Hub connection strings
- Configure Application Insights
- Run tests:
npm test
Testing
# Test retry functionality
node test.js
# Test with production data
node testData/prodTestData.js
Code Structure
src/Handler.js: Main retry processing logicsrc/actions/httprequest.js: HTTP request handling for retriesconfig/: Environment-specific configurationstestData/: Test data and scenarios
Operational Considerations
Monitoring Requirements
- Monitor deferred event volumes for service health indicators
- Track retry success rates to identify problematic services
- Alert on high failure rates or unusual retry patterns
- Monitor target service response times and availability
Capacity Planning
- Scale based on failed event volumes from upstream services
- Consider target service capacity when configuring retry attempts
- Monitor memory usage during high-volume retry processing
- Plan for burst capacity during service outages
Future Enhancements
Potential Improvements
- Exponential Backoff: Implement progressive delay between retry attempts
- Circuit Breaker: Temporary disable retries for consistently failing services
- Priority Queuing: Prioritize critical events for faster retry processing
- Batch Processing: Group similar events for more efficient retry processing
- Advanced Analytics: Enhanced failure pattern analysis and prediction