SelfHealer Service

Overview

The SelfHealer is a critical Azure Function microservice that provides automated error recovery and retry functionality for the Publisher platform. This service acts as a resilience layer that automatically retries failed events from other microservices, implements intelligent retry logic with exponential backoff, and manages deferred events that exceed retry limits. It serves as the platform's self-healing mechanism to ensure high availability and data consistency.

Business Purpose

This service serves as the platform's automated recovery system that: - Automatically retries failed events from other microservices - Implements intelligent retry logic with configurable retry attempts - Manages event recovery across the entire Publisher ecosystem - Provides resilience against transient failures and service outages - Ensures data consistency by preventing event loss - Reduces manual intervention for failed event processing - Maintains system reliability and availability - Provides metrics and monitoring for failure patterns

Architecture

Service Type

Platform: Azure Functions (Containerized Kubernetes Microservice)
Runtime: Node.js
Trigger: HTTP Trigger (Anonymous authentication)
Pattern: Event-Driven Retry and Recovery System

Key Components

graph TD
    A[Failed Events] --> B[SelfHealer Service]
    B --> C[Handler.js]
    C --> D[Message Validation]
    D --> E{Valid Message?}

    E -->|No| F[Skip Processing]
    E -->|Yes| G[Event Processing Loop]

    G --> H[Extract Events]
    H --> I[Initialize Retry Context]
    I --> J{Retry Attempts Left?}

    J -->|Yes| K[HTTP Request Action]
    K --> L[Send to Target Service]
    L --> M{Response Status?}

    M -->|Success| N[Mark as Successful]
    M -->|Failure| O[Increment Retry Count]

    O --> P{Max Retries Reached?}
    P -->|No| Q[Event Hub: error]
    P -->|Yes| R[Event Hub: deferred]

    N --> S[Complete Processing]
    Q --> T[Retry Later]
    R --> U[Manual Intervention Required]

    V[Target Microservices] --> L
    W[Metrics Tracking] --> B
    X[Application Insights] --> B

Core Functionality

Event Retry Management

Message Validation: Validates incoming retry messages and event structure
Retry Logic: Implements configurable retry attempts with intelligent backoff
HTTP Request Handling: Sends retry requests to target microservices
Response Processing: Analyzes responses to determine retry success/failure
Event Routing: Routes events to appropriate destinations based on retry status

Retry Strategy

Configurable Attempts: Default 3 retry attempts per event (configurable)
Intelligent Routing: Routes failed events back for retry or to deferred queue
Success Detection: Identifies successful processing and failed event responses
Deferred Handling: Manages events that exceed maximum retry attempts
Metrics Tracking: Tracks retry success rates and failure patterns

Key Features

Automated Recovery: Automatic retry of failed events without manual intervention
Configurable Retry Logic: Flexible retry configuration per service and event type
Intelligent Failure Detection: Sophisticated logic to determine retry success/failure
Event Preservation: Ensures no events are lost during retry processing
Service Integration: Seamless integration with all Publisher microservices
Monitoring and Metrics: Comprehensive tracking of retry patterns and success rates
Scalable Architecture: Handles high volumes of failed events efficiently

Message Format

Input Message Structure

{
    "serviceName": "target-service-name",
    "eventHubName": "source-event-hub",
    "selfhealer": {
        "action": {
            "http": {
                "type": "http",
                "url": "https://service-endpoint.com/api/retry",
                "timeout": 3000
            }
        },
        "maxRetryAttempts": 3
    },
    "events": [
        {
            "recordid": "event-record-id",
            "data": { /* original event data */ },
            "selfhealer": {
                "retryAttempt": 1,
                "success": false,
                "responseStatus": 500,
                "responseError": "Internal Server Error"
            }
        }
    ]
}

Event Processing Flow

Message Reception: Receives retry messages from Event Hub or direct HTTP
Validation: Validates message structure and required fields
Event Extraction: Extracts individual events for retry processing
Retry Context: Initializes or updates retry context for each event
HTTP Request: Sends retry request to target microservice
Response Analysis: Analyzes response to determine success/failure
Routing Decision: Routes to success, retry, or deferred based on outcome

Retry Logic

Retry Decision Matrix

Response Status	Failed Events in Response	Action
200	None	Success - Complete
200	Present	Retry - Failed events detected
Non-200	Any	Retry - Service error
Timeout	Any	Retry - Network/timeout issue

Retry Attempt Management

Initial Attempt: First retry attempt (retryAttempt = 1)
Subsequent Attempts: Increment retry counter for each attempt
Max Attempts: Default 3 attempts (configurable per service)
Deferred Events: Events exceeding max attempts sent to deferred queue

Event Routing

Success: Event processing completed successfully
Retry: Event sent back to error Event Hub for another attempt
Deferred: Event sent to deferred Event Hub for manual intervention

HTTP Request Configuration

Request Parameters

URL: Target service endpoint (configurable or default pattern)
Method: POST (fixed)
Timeout: Configurable timeout (default 3000ms)
Headers: Content-Type and x-eventhub headers
Payload: Original event data without selfhealer metadata

Default URL Pattern

https://{environment}-publisher.delty.com/kube/{serviceName}/

Custom URL Support

Services can specify custom retry endpoints in the selfhealer configuration.

Performance Characteristics

Processing Metrics

Throughput: ~100 retry events per second
Latency: 50-3000ms depending on target service response time
Success Rate: ~85% of retried events eventually succeed
Retry Efficiency: Average 1.5 retry attempts per failed event

Scalability Features

Concurrent Processing: Parallel processing of multiple events
Stateless Design: No state management for horizontal scaling
Efficient HTTP Handling: Optimized HTTP client with connection pooling
Memory Management: Efficient memory usage for large event batches

Dependencies

External Services

Target Microservices: All Publisher microservices that can fail
Event Hubs: Error and deferred event routing
Application Insights: Metrics and monitoring

Key NPM Packages

axios: HTTP client for retry requests
idgen: Unique identifier generation

Configuration

Environment-Specific Settings

Development: Development service endpoints and reduced timeouts
Integration: Integration testing with staging services
Production: Production service endpoints with optimized timeouts

Key Configuration Elements

HTTP URL pattern for service endpoints
Default timeout settings
Retry attempt limits
Logging levels
Application Insights configuration

Error Handling

Error Scenarios

Invalid Messages: Malformed retry messages or missing required fields
Target Service Unavailable: Target microservice is down or unreachable
Network Timeouts: Network connectivity issues or slow responses
HTTP Errors: Various HTTP error responses from target services
Processing Exceptions: Unexpected errors during retry processing

Recovery Mechanisms

Graceful Degradation: Continue processing other events if one fails
Error Logging: Comprehensive error logging for debugging
Metrics Tracking: Track failure patterns and service health
Deferred Queue: Safe handling of events that cannot be retried

Monitoring and Observability

Application Insights Integration

Custom Metrics: Service-specific retry success/failure metrics
Performance Tracking: Response time and throughput monitoring
Error Tracking: Comprehensive error logging and alerting
Dependency Tracking: Monitor target service health and performance

Key Metrics

Retry success rates by service
Average retry attempts per event
Deferred event volumes
Target service response times
Error rates and patterns

Metric Names

{serviceName}_deferred: Count of events sent to deferred queue
Custom telemetry for retry patterns and success rates

Event Hub Integration

Output Destinations

error: Events that need additional retry attempts
deferred: Events that have exceeded maximum retry attempts

Event Flow

Failed Events → SelfHealer → Target Service
Still Failing → SelfHealer → error Event Hub (retry)
Max Retries Exceeded → SelfHealer → deferred Event Hub (manual intervention)

Security Considerations

Anonymous HTTP Trigger: Internal service with no external authentication
Data Privacy: Secure handling of event data during retry processing
Network Security: HTTPS communication with target services
Audit Trail: Comprehensive logging for compliance and debugging

This service integrates with ALL Publisher microservices: - PostbackHandler: Retries failed postback events - RevenueEnrichment: Retries failed revenue processing events - DeviceTrackHandler: Retries failed device tracking events - DocumentCacheHandler: Retries failed cache update events - EventCounter: Retries failed counter update events - All Other Services: Provides retry capability for any service failure

Troubleshooting

Common Issues

High Deferred Event Volume: Check target service health and retry limits
Retry Loops: Verify target service response handling and success detection
Performance Issues: Monitor target service response times and timeout settings
Configuration Errors: Verify service URLs and retry configuration

Debug Steps

Check Application Insights for retry attempt patterns
Verify target service health and connectivity
Review retry configuration and timeout settings
Monitor deferred event queue for patterns
Analyze target service response formats and error handling

Development

Local Development Setup

Clone repository
Install dependencies: npm install
Configure target service endpoints
Set up Event Hub connection strings
Configure Application Insights
Run tests: npm test

Testing

# Test retry functionality
node test.js

# Test with production data
node testData/prodTestData.js

Code Structure

src/Handler.js: Main retry processing logic
src/actions/httprequest.js: HTTP request handling for retries
config/: Environment-specific configurations
testData/: Test data and scenarios

Operational Considerations

Monitoring Requirements

Monitor deferred event volumes for service health indicators
Track retry success rates to identify problematic services
Alert on high failure rates or unusual retry patterns
Monitor target service response times and availability

Capacity Planning

Scale based on failed event volumes from upstream services
Consider target service capacity when configuring retry attempts
Monitor memory usage during high-volume retry processing
Plan for burst capacity during service outages

Future Enhancements

Potential Improvements

Exponential Backoff: Implement progressive delay between retry attempts
Circuit Breaker: Temporary disable retries for consistently failing services
Priority Queuing: Prioritize critical events for faster retry processing
Batch Processing: Group similar events for more efficient retry processing
Advanced Analytics: Enhanced failure pattern analysis and prediction