Skip to content

SelfHealer Service

Overview

The SelfHealer is a critical Azure Function microservice that provides automated error recovery and retry functionality for the Publisher platform. This service acts as a resilience layer that automatically retries failed events from other microservices, implements intelligent retry logic with exponential backoff, and manages deferred events that exceed retry limits. It serves as the platform's self-healing mechanism to ensure high availability and data consistency.

Business Purpose

This service serves as the platform's automated recovery system that: - Automatically retries failed events from other microservices - Implements intelligent retry logic with configurable retry attempts - Manages event recovery across the entire Publisher ecosystem - Provides resilience against transient failures and service outages - Ensures data consistency by preventing event loss - Reduces manual intervention for failed event processing - Maintains system reliability and availability - Provides metrics and monitoring for failure patterns

Architecture

Service Type

  • Platform: Azure Functions (Containerized Kubernetes Microservice)
  • Runtime: Node.js
  • Trigger: HTTP Trigger (Anonymous authentication)
  • Pattern: Event-Driven Retry and Recovery System

Key Components

graph TD
    A[Failed Events] --> B[SelfHealer Service]
    B --> C[Handler.js]
    C --> D[Message Validation]
    D --> E{Valid Message?}

    E -->|No| F[Skip Processing]
    E -->|Yes| G[Event Processing Loop]

    G --> H[Extract Events]
    H --> I[Initialize Retry Context]
    I --> J{Retry Attempts Left?}

    J -->|Yes| K[HTTP Request Action]
    K --> L[Send to Target Service]
    L --> M{Response Status?}

    M -->|Success| N[Mark as Successful]
    M -->|Failure| O[Increment Retry Count]

    O --> P{Max Retries Reached?}
    P -->|No| Q[Event Hub: error]
    P -->|Yes| R[Event Hub: deferred]

    N --> S[Complete Processing]
    Q --> T[Retry Later]
    R --> U[Manual Intervention Required]

    V[Target Microservices] --> L
    W[Metrics Tracking] --> B
    X[Application Insights] --> B

Core Functionality

Event Retry Management

  1. Message Validation: Validates incoming retry messages and event structure
  2. Retry Logic: Implements configurable retry attempts with intelligent backoff
  3. HTTP Request Handling: Sends retry requests to target microservices
  4. Response Processing: Analyzes responses to determine retry success/failure
  5. Event Routing: Routes events to appropriate destinations based on retry status

Retry Strategy

  1. Configurable Attempts: Default 3 retry attempts per event (configurable)
  2. Intelligent Routing: Routes failed events back for retry or to deferred queue
  3. Success Detection: Identifies successful processing and failed event responses
  4. Deferred Handling: Manages events that exceed maximum retry attempts
  5. Metrics Tracking: Tracks retry success rates and failure patterns

Key Features

  • Automated Recovery: Automatic retry of failed events without manual intervention
  • Configurable Retry Logic: Flexible retry configuration per service and event type
  • Intelligent Failure Detection: Sophisticated logic to determine retry success/failure
  • Event Preservation: Ensures no events are lost during retry processing
  • Service Integration: Seamless integration with all Publisher microservices
  • Monitoring and Metrics: Comprehensive tracking of retry patterns and success rates
  • Scalable Architecture: Handles high volumes of failed events efficiently

Message Format

Input Message Structure

{
    "serviceName": "target-service-name",
    "eventHubName": "source-event-hub",
    "selfhealer": {
        "action": {
            "http": {
                "type": "http",
                "url": "https://service-endpoint.com/api/retry",
                "timeout": 3000
            }
        },
        "maxRetryAttempts": 3
    },
    "events": [
        {
            "recordid": "event-record-id",
            "data": { /* original event data */ },
            "selfhealer": {
                "retryAttempt": 1,
                "success": false,
                "responseStatus": 500,
                "responseError": "Internal Server Error"
            }
        }
    ]
}

Event Processing Flow

  1. Message Reception: Receives retry messages from Event Hub or direct HTTP
  2. Validation: Validates message structure and required fields
  3. Event Extraction: Extracts individual events for retry processing
  4. Retry Context: Initializes or updates retry context for each event
  5. HTTP Request: Sends retry request to target microservice
  6. Response Analysis: Analyzes response to determine success/failure
  7. Routing Decision: Routes to success, retry, or deferred based on outcome

Retry Logic

Retry Decision Matrix

Response Status Failed Events in Response Action
200 None Success - Complete
200 Present Retry - Failed events detected
Non-200 Any Retry - Service error
Timeout Any Retry - Network/timeout issue

Retry Attempt Management

  • Initial Attempt: First retry attempt (retryAttempt = 1)
  • Subsequent Attempts: Increment retry counter for each attempt
  • Max Attempts: Default 3 attempts (configurable per service)
  • Deferred Events: Events exceeding max attempts sent to deferred queue

Event Routing

  • Success: Event processing completed successfully
  • Retry: Event sent back to error Event Hub for another attempt
  • Deferred: Event sent to deferred Event Hub for manual intervention

HTTP Request Configuration

Request Parameters

  • URL: Target service endpoint (configurable or default pattern)
  • Method: POST (fixed)
  • Timeout: Configurable timeout (default 3000ms)
  • Headers: Content-Type and x-eventhub headers
  • Payload: Original event data without selfhealer metadata

Default URL Pattern

https://{environment}-publisher.delty.com/kube/{serviceName}/

Custom URL Support

Services can specify custom retry endpoints in the selfhealer configuration.

Performance Characteristics

Processing Metrics

  • Throughput: ~100 retry events per second
  • Latency: 50-3000ms depending on target service response time
  • Success Rate: ~85% of retried events eventually succeed
  • Retry Efficiency: Average 1.5 retry attempts per failed event

Scalability Features

  • Concurrent Processing: Parallel processing of multiple events
  • Stateless Design: No state management for horizontal scaling
  • Efficient HTTP Handling: Optimized HTTP client with connection pooling
  • Memory Management: Efficient memory usage for large event batches

Dependencies

External Services

  • Target Microservices: All Publisher microservices that can fail
  • Event Hubs: Error and deferred event routing
  • Application Insights: Metrics and monitoring

Key NPM Packages

  • axios: HTTP client for retry requests
  • idgen: Unique identifier generation

Configuration

Environment-Specific Settings

  • Development: Development service endpoints and reduced timeouts
  • Integration: Integration testing with staging services
  • Production: Production service endpoints with optimized timeouts

Key Configuration Elements

  • HTTP URL pattern for service endpoints
  • Default timeout settings
  • Retry attempt limits
  • Logging levels
  • Application Insights configuration

Error Handling

Error Scenarios

  1. Invalid Messages: Malformed retry messages or missing required fields
  2. Target Service Unavailable: Target microservice is down or unreachable
  3. Network Timeouts: Network connectivity issues or slow responses
  4. HTTP Errors: Various HTTP error responses from target services
  5. Processing Exceptions: Unexpected errors during retry processing

Recovery Mechanisms

  • Graceful Degradation: Continue processing other events if one fails
  • Error Logging: Comprehensive error logging for debugging
  • Metrics Tracking: Track failure patterns and service health
  • Deferred Queue: Safe handling of events that cannot be retried

Monitoring and Observability

Application Insights Integration

  • Custom Metrics: Service-specific retry success/failure metrics
  • Performance Tracking: Response time and throughput monitoring
  • Error Tracking: Comprehensive error logging and alerting
  • Dependency Tracking: Monitor target service health and performance

Key Metrics

  • Retry success rates by service
  • Average retry attempts per event
  • Deferred event volumes
  • Target service response times
  • Error rates and patterns

Metric Names

  • {serviceName}_deferred: Count of events sent to deferred queue
  • Custom telemetry for retry patterns and success rates

Event Hub Integration

Output Destinations

  • error: Events that need additional retry attempts
  • deferred: Events that have exceeded maximum retry attempts

Event Flow

  1. Failed Events → SelfHealer → Target Service
  2. Still Failing → SelfHealer → error Event Hub (retry)
  3. Max Retries Exceeded → SelfHealer → deferred Event Hub (manual intervention)

Security Considerations

  • Anonymous HTTP Trigger: Internal service with no external authentication
  • Data Privacy: Secure handling of event data during retry processing
  • Network Security: HTTPS communication with target services
  • Audit Trail: Comprehensive logging for compliance and debugging

This service integrates with ALL Publisher microservices: - PostbackHandler: Retries failed postback events - RevenueEnrichment: Retries failed revenue processing events - DeviceTrackHandler: Retries failed device tracking events - DocumentCacheHandler: Retries failed cache update events - EventCounter: Retries failed counter update events - All Other Services: Provides retry capability for any service failure

Troubleshooting

Common Issues

  1. High Deferred Event Volume: Check target service health and retry limits
  2. Retry Loops: Verify target service response handling and success detection
  3. Performance Issues: Monitor target service response times and timeout settings
  4. Configuration Errors: Verify service URLs and retry configuration

Debug Steps

  1. Check Application Insights for retry attempt patterns
  2. Verify target service health and connectivity
  3. Review retry configuration and timeout settings
  4. Monitor deferred event queue for patterns
  5. Analyze target service response formats and error handling

Development

Local Development Setup

  1. Clone repository
  2. Install dependencies: npm install
  3. Configure target service endpoints
  4. Set up Event Hub connection strings
  5. Configure Application Insights
  6. Run tests: npm test

Testing

# Test retry functionality
node test.js

# Test with production data
node testData/prodTestData.js

Code Structure

  • src/Handler.js: Main retry processing logic
  • src/actions/httprequest.js: HTTP request handling for retries
  • config/: Environment-specific configurations
  • testData/: Test data and scenarios

Operational Considerations

Monitoring Requirements

  • Monitor deferred event volumes for service health indicators
  • Track retry success rates to identify problematic services
  • Alert on high failure rates or unusual retry patterns
  • Monitor target service response times and availability

Capacity Planning

  • Scale based on failed event volumes from upstream services
  • Consider target service capacity when configuring retry attempts
  • Monitor memory usage during high-volume retry processing
  • Plan for burst capacity during service outages

Future Enhancements

Potential Improvements

  • Exponential Backoff: Implement progressive delay between retry attempts
  • Circuit Breaker: Temporary disable retries for consistently failing services
  • Priority Queuing: Prioritize critical events for faster retry processing
  • Batch Processing: Group similar events for more efficient retry processing
  • Advanced Analytics: Enhanced failure pattern analysis and prediction