BFF Resilience with Polly: Circuit Breaker, Retry & Timeout

A BFF that aggregates four upstream services inherits four independent failure modes. Any one of them can be unavailable, slow, or intermittently returning errors at any time. The question is not whether an upstream service will fail — it will — but whether that failure propagates to the user as a broken screen or is absorbed by the BFF and handled gracefully.

Polly is the .NET resilience library that provides the building blocks to absorb those failures: retries for transient errors, timeouts for slow upstreams, circuit breakers for services that are systematically down, and bulkheads for isolating one upstream's failure from another's. Used correctly, these patterns make the BFF fault-tolerant. Used incorrectly — retrying too aggressively, timing out too generously, failing to isolate failure domains — they amplify the problems they were meant to solve.

This article covers the correct application of each pattern to a BFF, the specific failure modes each one addresses, and how they compose into a production-grade resilience strategy. Code examples use the education platform BFF built throughout this series and Microsoft.Extensions.Http.Resilience, the .NET 8 integration layer that wires Polly into the HttpClient pipeline.

The resilience problem, stated precisely

In Article 4, every typed HTTP client was configured with AddStandardResilienceHandler():

builder.Services.AddHttpClient<CourseServiceClient>(client =>
    client.BaseAddress = new Uri(builder.Configuration["Services:CourseService:BaseUrl"]!))
    .AddStandardResilienceHandler();

The standard handler is a reasonable starting point — it wires retry, circuit breaker, and timeout with sensible defaults. But it is a generic solution, and a BFF has specific requirements that the defaults do not address:

Different upstream services have different acceptable latency budgets. A user profile lookup should time out faster than a course session export.
Retrying a user profile call three times is reasonable. Retrying an enrollment mutation three times could create three enrollments.
A circuit breaker that opens for 30 seconds on the notification service should not affect the circuit breaker state of the course service.
The aggregator's partial failure handling (from Article 4) depends on the resilience layer returning a specific failure signal — not throwing an exception that crashes the entire aggregation.

These requirements mean the standard handler needs to be replaced with per-client custom configuration for any BFF that runs in production under real conditions.

Installing the right packages

dotnet add package Microsoft.Extensions.Http.Resilience
dotnet add package Polly
dotnet add package Polly.Extensions

Microsoft.Extensions.Http.Resilience is the preferred integration layer in .NET 8. It uses Polly 8 under the hood and integrates with IHttpClientFactory, ILogger, and IMetricsFactory from the host. The raw Polly package is used for building custom strategies; Polly.Extensions provides the ResiliencePipelineBuilder extensions.

Understanding the strategy execution order

Before configuring individual strategies, the order in which they wrap each request matters significantly. The standard pipeline executes strategies from outermost to innermost:

The total timeout is the hard wall — no matter how many retries are attempted, the entire operation cannot exceed this duration. The retry wraps the circuit breaker, which means the circuit breaker sees individual attempt outcomes. The attempt timeout is per-attempt — if a single upstream call takes longer than the attempt timeout, it is cancelled and the retry policy fires.

This ordering is not arbitrary. Inverting the circuit breaker and the retry would mean the circuit breaker sees aggregate retry counts as single outcomes, which defeats its purpose. Understanding this ordering is prerequisite to understanding why the custom configuration below is shaped the way it is.

Per-client resilience configuration

The correct approach for a BFF is to define a resilience pipeline per upstream client, tuned to that service's characteristics. A helper method keeps the configuration readable:

// Infrastructure/Resilience/ResiliencePipelineFactory.cs
public static class ResiliencePipelineFactory
{
    /// <summary>
    /// Standard read pipeline — safe to retry, moderate timeout.
    /// Use for GET requests to stable internal services.
    /// </summary>
    public static Action<ResiliencePipelineBuilder<HttpResponseMessage>>
        ReadPipeline(
            string serviceName,
            TimeSpan attemptTimeout,
            TimeSpan totalTimeout) => pipeline =>
    {
        pipeline
            // 1. Total timeout — hard limit on the whole operation including retries
            .AddTimeout(new TimeoutStrategyOptions
            {
                Timeout = totalTimeout,
                OnTimeout = args =>
                {
                    Log.TotalTimeout(args.Context.GetLogger(), serviceName, totalTimeout);
                    return ValueTask.CompletedTask;
                }
            })

            // 2. Retry — exponential backoff with jitter, read-safe
            .AddRetry(new RetryStrategyOptions<HttpResponseMessage>
            {
                MaxRetryAttempts = 2,
                Delay             = TimeSpan.FromMilliseconds(200),
                BackoffType       = DelayBackoffType.Exponential,
                UseJitter         = true,
                ShouldHandle      = args => ValueTask.FromResult(
                    ShouldRetry(args.Outcome)),
                OnRetry = args =>
                {
                    Log.Retrying(args.Context.GetLogger(), serviceName,
                        args.AttemptNumber + 1, args.RetryDelay);
                    return ValueTask.CompletedTask;
                }
            })

            // 3. Circuit breaker — opens after sustained failures
            .AddCircuitBreaker(new CircuitBreakerStrategyOptions<HttpResponseMessage>
            {
                FailureRatio            = 0.5,   // Open when 50% of requests fail
                SamplingDuration        = TimeSpan.FromSeconds(30),
                MinimumThroughput       = 5,     // Minimum requests before ratio applies
                BreakDuration           = TimeSpan.FromSeconds(20),
                ShouldHandle            = args => ValueTask.FromResult(
                    ShouldHandle(args.Outcome)),
                OnOpened = args =>
                {
                    Log.CircuitOpened(args.Context.GetLogger(), serviceName,
                        args.BreakDuration);
                    return ValueTask.CompletedTask;
                },
                OnClosed = args =>
                {
                    Log.CircuitClosed(args.Context.GetLogger(), serviceName);
                    return ValueTask.CompletedTask;
                },
                OnHalfOpened = args =>
                {
                    Log.CircuitHalfOpened(args.Context.GetLogger(), serviceName);
                    return ValueTask.CompletedTask;
                }
            })

            // 4. Per-attempt timeout — cancels a single slow call before retry fires
            .AddTimeout(new TimeoutStrategyOptions
            {
                Timeout = attemptTimeout
            });
    };

    /// <summary>
    /// Write pipeline — NOT safe to retry on most failures.
    /// Use for POST/PUT/DELETE requests where idempotency cannot be guaranteed.
    /// </summary>
    public static Action<ResiliencePipelineBuilder<HttpResponseMessage>>
        WritePipeline(string serviceName, TimeSpan attemptTimeout) => pipeline =>
    {
        pipeline
            // Total timeout only — no retry for non-idempotent operations
            .AddTimeout(new TimeoutStrategyOptions { Timeout = attemptTimeout })

            // Circuit breaker — still needed to fail-fast when service is down
            .AddCircuitBreaker(new CircuitBreakerStrategyOptions<HttpResponseMessage>
            {
                FailureRatio      = 0.5,
                SamplingDuration  = TimeSpan.FromSeconds(30),
                MinimumThroughput = 3,
                BreakDuration     = TimeSpan.FromSeconds(20),
                ShouldHandle      = args => ValueTask.FromResult(
                    ShouldHandle(args.Outcome))
            });
    };

    // Which outcomes warrant a retry
    private static bool ShouldRetry(Outcome<HttpResponseMessage> outcome)
    {
        if (outcome.Exception is HttpRequestException or TaskCanceledException)
            return true;

        if (outcome.Result is { } response)
            return response.StatusCode is
                HttpStatusCode.RequestTimeout or      // 408
                HttpStatusCode.TooManyRequests or     // 429
                HttpStatusCode.InternalServerError or // 500
                HttpStatusCode.BadGateway or          // 502
                HttpStatusCode.ServiceUnavailable or  // 503
                HttpStatusCode.GatewayTimeout;        // 504

        return false;
    }

    // Which outcomes the circuit breaker counts as failures
    private static bool ShouldHandle(Outcome<HttpResponseMessage> outcome)
    {
        if (outcome.Exception is not null) return true;
        if (outcome.Result is { } response)
            return (int)response.StatusCode >= 500;
        return false;
    }

    // Structured log messages — static for performance
    private static class Log
    {
        public static void TotalTimeout(ILogger? logger, string service, TimeSpan timeout) =>
            logger?.LogWarning(
                "Total timeout exceeded for {Service}. Timeout: {Timeout}ms",
                service, timeout.TotalMilliseconds);

        public static void Retrying(ILogger? logger, string service,
            int attempt, TimeSpan delay) =>
            logger?.LogWarning(
                "Retrying {Service}. Attempt: {Attempt}, Delay: {DelayMs}ms",
                service, attempt, delay.TotalMilliseconds);

        public static void CircuitOpened(ILogger? logger, string service,
            TimeSpan breakDuration) =>
            logger?.LogError(
                "Circuit breaker OPENED for {Service}. " +
                "Break duration: {BreakDuration}s. Upstream calls suspended.",
                service, breakDuration.TotalSeconds);

        public static void CircuitClosed(ILogger? logger, string service) =>
            logger?.LogInformation(
                "Circuit breaker CLOSED for {Service}. Upstream calls resumed.", service);

        public static void CircuitHalfOpened(ILogger? logger, string service) =>
            logger?.LogInformation(
                "Circuit breaker HALF-OPEN for {Service}. Probing upstream.", service);
    }
}

Wiring per-client pipelines in Program.cs

Each upstream client receives a pipeline tuned to its characteristics. The latency budget for each service was derived from the p95 response times observed in Application Insights during the first month of production operation:

// Program.cs
// User Service — small, fast lookups; tight timeout; retryable
builder.Services
    .AddHttpClient<UserServiceClient>(client =>
        client.BaseAddress = new Uri(config["Services:UserService:BaseUrl"]!))
    .AddResilienceHandler("user-service",
        ResiliencePipelineFactory.ReadPipeline(
            serviceName:    "UserService",
            attemptTimeout: TimeSpan.FromMilliseconds(400),
            totalTimeout:   TimeSpan.FromMilliseconds(1200)))
    .AddHttpMessageHandler<FeideTokenHandler>();

// Course Service — dataset can be larger; slightly longer timeout
builder.Services
    .AddHttpClient<CourseServiceClient>(client =>
        client.BaseAddress = new Uri(config["Services:CourseService:BaseUrl"]!))
    .AddResilienceHandler("course-service",
        ResiliencePipelineFactory.ReadPipeline(
            serviceName:    "CourseService",
            attemptTimeout: TimeSpan.FromMilliseconds(600),
            totalTimeout:   TimeSpan.FromMilliseconds(2000)))
    .AddHttpMessageHandler<FeideTokenHandler>();

// Session Service — potentially heavier queries; more generous total timeout
builder.Services
    .AddHttpClient<SessionServiceClient>(client =>
        client.BaseAddress = new Uri(config["Services:SessionService:BaseUrl"]!))
    .AddResilienceHandler("session-service",
        ResiliencePipelineFactory.ReadPipeline(
            serviceName:    "SessionService",
            attemptTimeout: TimeSpan.FromMilliseconds(800),
            totalTimeout:   TimeSpan.FromMilliseconds(2500)))
    .AddHttpMessageHandler<FeideTokenHandler>();

// Notification Service — low criticality; tight budget
builder.Services
    .AddHttpClient<NotificationServiceClient>(client =>
        client.BaseAddress = new Uri(config["Services:NotificationService:BaseUrl"]!))
    .AddResilienceHandler("notification-service",
        ResiliencePipelineFactory.ReadPipeline(
            serviceName:    "NotificationService",
            attemptTimeout: TimeSpan.FromMilliseconds(300),
            totalTimeout:   TimeSpan.FromMilliseconds(900)))
    .AddHttpMessageHandler<FeideTokenHandler>();

// Enrollment — write operation; no retry; circuit breaker only
builder.Services
    .AddHttpClient<EnrollmentServiceClient>(client =>
        client.BaseAddress = new Uri(config["Services:CourseService:BaseUrl"]!))
    .AddResilienceHandler("enrollment-write",
        ResiliencePipelineFactory.WritePipeline(
            serviceName:    "EnrollmentService",
            attemptTimeout: TimeSpan.FromSeconds(5)))
    .AddHttpMessageHandler<FeideTokenHandler>();

The notification service has the tightest budget — 300ms per attempt, 900ms total — because it is the least critical upstream in the aggregation. If notifications are slow, the partial failure path (from Article 4) handles the absence gracefully. Spending 2.5 seconds waiting for a notification count is a worse user experience than returning a count of zero with a partialFailures: ["notifications"] marker.

Handling resilience exceptions in typed clients

The resilience pipeline throws specific exceptions when it exhausts its strategies. The typed clients must catch these and return null so the aggregator's partial failure logic can handle them:

// Clients/CourseServiceClient.cs
public sealed class CourseServiceClient(
    HttpClient http,
    IHttpContextAccessor contextAccessor,
    ILogger<CourseServiceClient> logger)
{
    public async Task<IReadOnlyList<CourseDto>?> GetCoursesByOrgAsync(
        string orgId, CancellationToken ct = default)
    {
        try
        {
            var request = new HttpRequestMessage(
                HttpMethod.Get, $"courses?orgId={orgId}");
            AttachCorrelationId(request);

            var response = await http.SendAsync(request, ct);
            response.EnsureSuccessStatusCode();
            return await response.Content
                .ReadFromJsonAsync<IReadOnlyList<CourseDto>>(ct);
        }
        catch (BrokenCircuitException ex)
        {
            // Circuit is open — upstream is known-bad, skip immediately
            logger.LogWarning(
                "Circuit open for CourseService. Skipping upstream call. " +
                "OrgId: {OrgId}. Message: {Message}", orgId, ex.Message);
            return null;
        }
        catch (TimeoutRejectedException ex)
        {
            // Total timeout exhausted — upstream is too slow
            logger.LogWarning(
                "Timeout exhausted for CourseService. OrgId: {OrgId}. " +
                "Duration: {Duration}ms", orgId, ex.Telemetry.ExecutionTime.TotalMilliseconds);
            return null;
        }
        catch (HttpRequestException ex)
        {
            logger.LogWarning(ex,
                "HTTP error from CourseService. OrgId: {OrgId}. Status: {Status}",
                orgId, ex.StatusCode);
            return null;
        }
        catch (OperationCanceledException) when (!ct.IsCancellationRequested)
        {
            // Cancelled by the per-attempt timeout, not by the caller
            logger.LogWarning(
                "CourseService call cancelled by attempt timeout. OrgId: {OrgId}", orgId);
            return null;
        }
    }

    private void AttachCorrelationId(HttpRequestMessage request)
    {
        var correlationId = contextAccessor.HttpContext?
            .Response.Headers["X-Correlation-Id"].FirstOrDefault();
        if (correlationId is not null)
            request.Headers.TryAddWithoutValidation("X-Correlation-Id", correlationId);
    }
}

BrokenCircuitException is the most important case. When the circuit is open, Polly throws this exception immediately — no upstream call is made. The client catches it and returns null, which the aggregator records as a partial failure. A screen that would have waited 2.5 seconds for a timing-out upstream now fails in microseconds. This is the circuit breaker's primary value: fail fast rather than fail slow.

The aggregator: partial failure as a first-class outcome

The aggregator receives null from clients whose upstream calls failed, regardless of which resilience strategy triggered the failure. The distinction between a BrokenCircuitException and a TimeoutRejectedException is logged at the client level; the aggregator only sees the null result and decides what to do with it.

// Aggregators/DashboardAggregator.cs
public async Task<DashboardResponse> AggregateAsync(
    string userId, CancellationToken ct = default)
{
    var partialFailures = new List<string>();

    // Phase 1: parallel — both can fail independently
    var profileTask      = _userClient.GetProfileAsync(userId, ct);
    var notificationTask = _notificationClient.GetUnreadCountAsync(userId, ct);
    await Task.WhenAll(profileTask, notificationTask);

    var profile = profileTask.Result;

    // Profile is required — its absence is not a partial failure, it is a hard stop
    if (profile is null)
        throw new BffAggregationException("User profile service unavailable.");

    // Notification is optional — absence is gracefully degraded
    var notificationCount = notificationTask.Result ?? 0;
    if (notificationTask.Result is null)
        partialFailures.Add("notifications");

    // Phase 2: courses — absence degrades but does not fail the response
    var courses = await _courseClient.GetCoursesByOrgAsync(profile.OrgId, ct);
    if (courses is null)
        partialFailures.Add("courses");

    // Phase 3: sessions — only attempted if courses succeeded
    IReadOnlyList<SessionDto>? sessions = null;
    if (courses is { Count: > 0 })
    {
        sessions = await _sessionClient.GetUpcomingAsync(
            courses.Select(c => c.Id).ToArray(), 3, ct);
        if (sessions is null)
            partialFailures.Add("sessions");
    }

    return new DashboardResponse(
        User:             ShapeUserProfile(profile),
        Courses:          courses?.Select(ShapeCourse).ToList() ?? [],
        UpcomingSessions: sessions?.Select(ShapeSession).ToList() ?? [],
        Notifications:    new NotificationSummary(notificationCount),
        PartialFailures:  partialFailures
    );
}

The aggregator does not know whether courses is null because the circuit breaker opened, because a retry was exhausted, or because the service returned a 500. That distinction belongs in the client's log entry, which carries the correlation ID. The aggregator concerns itself only with the outcome: data was available or it was not.

The retry problem: when not to retry

The ShouldRetry predicate above deliberately excludes certain status codes. This is the most consequential decision in retry configuration.

Do not retry 4xx errors (except 408 and 429). A 400 Bad Request means the request itself is malformed — retrying the same request will produce the same 400. A 401 or 403 means the caller is not authorised — retrying will not change that. A 404 means the resource does not exist — retrying will not create it. The only 4xx codes worth retrying are 408 (request timeout, which may have been a transient infrastructure issue) and 429 (too many requests, which should be retried after the Retry-After header's delay).

Do not retry non-idempotent operations. The write pipeline above has no retry. A POST to /courses/{id}/enrollment that creates an enrollment and then returns a 500 due to a response serialisation error has still created the enrollment. Retrying creates a duplicate. The service must be designed to be idempotent — or the client must not retry.

In the production system, this distinction caused one production incident before the write pipeline was separated. A retry on a 500 from the enrollment service — which had successfully created the enrollment before encountering a downstream notification error — created duplicate enrollments for four students. The fix was the dedicated write pipeline with no retry and explicit idempotency keys added to the enrollment POST.

Idempotency keys for safe write retries

If retrying writes is genuinely required, idempotency keys are the mechanism. The BFF generates a key for the operation, sends it with the request, and the upstream service uses it to deduplicate:

// Clients/EnrollmentServiceClient.cs
public async Task<EnrollmentResultDto?> EnrollAsync(
    string courseId, string userId,
    string idempotencyKey, // Caller-provided — generated once per user action
    CancellationToken ct = default)
{
    var request = new HttpRequestMessage(
        HttpMethod.Post, $"courses/{courseId}/enrollments");
    request.Headers.TryAddWithoutValidation("Idempotency-Key", idempotencyKey);
    request.Content = JsonContent.Create(new { UserId = userId });

    // With idempotency key, retry is safe — upstream will deduplicate
    var response = await http.SendAsync(request, ct);
    response.EnsureSuccessStatusCode();
    return await response.Content.ReadFromJsonAsync<EnrollmentResultDto>(ct);
}

The idempotency key is generated in the BFF endpoint from the user ID and the course ID, making it stable for the same logical operation regardless of how many times it is submitted:

// Endpoints/EnrollmentEndpoints.cs
private static async Task<IResult> EnrollAsync(
    string courseId, HttpContext ctx,
    EnrollmentServiceClient enrollmentClient,
    EnrollmentCache enrollmentCache,
    CancellationToken ct)
{
    var userId = ctx.User.FindFirstValue(ClaimTypes.NameIdentifier)!;

    // Deterministic key — same user + course always produces the same key
    var idempotencyKey = Convert.ToHexString(
        SHA256.HashData(
            Encoding.UTF8.GetBytes($"{userId}:{courseId}:{DateTime.UtcNow:yyyy-MM-dd}")));

    var result = await enrollmentClient.EnrollAsync(courseId, userId, idempotencyKey, ct);

    if (result is null)
        return Results.Problem(
            detail: "Enrollment could not be processed.",
            statusCode: StatusCodes.Status502BadGateway);

    await enrollmentCache.InvalidateAsync(userId, ct);
    return Results.Ok(result);
}

The date component in the key ensures that the same enrollment attempt on different days produces different keys — which is correct, since a student might legitimately withdraw and re-enroll in the same course across days.

Bulkhead isolation: containing failure domains

A bulkhead limits the number of concurrent calls to a specific upstream service. Without bulkheads, a slow upstream service can exhaust the BFF's thread pool — every available thread is waiting on that upstream, and requests to other upstreams queue behind them.

Bulkhead support in Microsoft.Extensions.Http.Resilience is provided through the AddConcurrencyLimiter extension:

// For services that are particularly prone to slowdowns under load
builder.Services
    .AddHttpClient<SessionServiceClient>(client =>
        client.BaseAddress = new Uri(config["Services:SessionService:BaseUrl"]!))
    .AddResilienceHandler("session-service", pipeline =>
    {
        // Add bulkhead before the read pipeline strategies
        pipeline.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
        {
            PermitLimit = 20,  // Max 20 concurrent calls to SessionService
            QueueLimit  = 5    // Queue up to 5 more — reject beyond that
        });

        // Then the standard read pipeline strategies
        ResiliencePipelineFactory.ReadPipeline(
            "SessionService",
            TimeSpan.FromMilliseconds(800),
            TimeSpan.FromMilliseconds(2500))(pipeline);
    })
    .AddHttpMessageHandler<FeideTokenHandler>();

When the session service is slow and 20 concurrent BFF requests are already waiting for it, the 21st through 25th requests queue. The 26th is rejected immediately with a RateLimitRejectedException, which the client catches and returns as null — a partial failure for sessions, not a hard error. The user profile and course data still load; only upcoming sessions are absent.

Without the bulkhead, the 26th request would add another waiting thread. At sufficient load, the BFF's thread pool is exhausted by session service calls, and requests to the user service — which might be perfectly healthy — cannot execute. The bulkhead contains the blast radius.

Testing resilience behaviour

Resilience strategies are only trustworthy if they are tested. The integration test factory from Article 8 provides the mechanism — configure the substitute to throw the exceptions that the resilience layer would throw, and verify the aggregator's response.

// EducationPlatform.Bff.IntegrationTests/Resilience/CircuitBreakerTests.cs
public class CircuitBreakerTests(BffWebApplicationFactory factory)
    : IClassFixture<BffWebApplicationFactory>
{
    [Fact]
    public async Task Dashboard_CourseServiceCircuitOpen_ReturnsDegradedResponse()
    {
        // Arrange — profile and notifications available; courses circuit open
        factory.UserClient
            .GetProfileAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
            .Returns(new UserProfileDto("Ingrid", "Solberg", "uninett", "TEACHER", null));

        factory.NotificationClient
            .GetUnreadCountAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
            .Returns(0);

        // Simulate the client returning null (as it would after catching BrokenCircuitException)
        factory.CourseClient
            .GetCoursesByOrgAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
            .Returns((IReadOnlyList<CourseDto>?)null);

        var client = factory.CreateAuthenticatedClient();

        // Act
        var response = await client.GetAsync("/api/dashboard");

        // Assert — 200 with partial failure, not a 503
        response.StatusCode.Should().Be(HttpStatusCode.OK);
        var body = await response.Content.ReadFromJsonAsync<DashboardResponse>();
        body!.Courses.Should().BeEmpty();
        body.PartialFailures.Should().Contain("courses");
        body.User.DisplayName.Should().Be("Ingrid Solberg"); // User data unaffected
    }

    [Fact]
    public async Task Dashboard_AllNonCriticalServicesUnavailable_ReturnsMinimalResponse()
    {
        // Arrange — only profile available
        factory.UserClient
            .GetProfileAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
            .Returns(new UserProfileDto("Ingrid", "Solberg", "uninett", "TEACHER", null));

        factory.NotificationClient
            .GetUnreadCountAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
            .Returns((int?)null);

        factory.CourseClient
            .GetCoursesByOrgAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
            .Returns((IReadOnlyList<CourseDto>?)null);

        var client = factory.CreateAuthenticatedClient();

        // Act
        var response = await client.GetAsync("/api/dashboard");

        // Assert — still a valid response, just minimally populated
        response.StatusCode.Should().Be(HttpStatusCode.OK);
        var body = await response.Content.ReadFromJsonAsync<DashboardResponse>();
        body!.User.Should().NotBeNull(); // The one thing that always works
        body.Courses.Should().BeEmpty();
        body.UpcomingSessions.Should().BeEmpty();
        body.Notifications.Count.Should().Be(0);
        body.PartialFailures.Should().HaveCount(2)
            .And.Contain("courses")
            .And.Contain("notifications");
    }

    [Fact]
    public async Task Dashboard_ProfileServiceUnavailable_Returns503()
    {
        // Profile is required — its absence cannot be partially failed
        factory.UserClient
            .GetProfileAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
            .Returns((UserProfileDto?)null);

        factory.NotificationClient
            .GetUnreadCountAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
            .Returns(0);

        var client = factory.CreateAuthenticatedClient();

        var response = await client.GetAsync("/api/dashboard");

        response.StatusCode.Should().Be(HttpStatusCode.ServiceUnavailable);
    }
}

These tests verify the aggregator's partial failure behaviour under resilience-layer outcomes. They test the outcomes — what the Vue application receives — not the Polly strategies themselves. Testing Polly's internal behaviour is Polly's job; testing that your aggregator responds correctly to the signals Polly produces is yours.

Observing resilience behaviour in production

Resilience events — retries, timeouts, circuit breaker state changes — must be visible in Application Insights. The OnRetry, OnOpened, OnClosed, and OnHalfOpened callbacks in the pipeline configuration (shown above) emit structured log entries that Serilog writes to Application Insights.

A KQL query that surfaces retry activity:

traces
| where timestamp > ago(1h)
| where message contains "Retrying"
| extend
    Service = tostring(customDimensions["Service"]),
    Attempt = toint(customDimensions["Attempt"])
| summarize RetryCount = count() by Service, bin(timestamp, 5m)
| render timechart

And circuit breaker openings:

traces
| where timestamp > ago(24h)
| where message contains "Circuit breaker OPENED"
| extend Service = tostring(customDimensions["Service"])
| project timestamp, Service, message
| order by timestamp desc

In the production system, this query was part of the daily operational review. A circuit breaker opening is always a signal worth investigating — it means an upstream service sustained a 50% failure rate for at least 30 seconds with at least 5 requests. That is not a transient blip; it is an upstream service in trouble. The circuit breaker opening is often the first observable signal of an upstream incident, arriving before alerts from the upstream team's own monitoring.

The settings that required tuning in production

The initial resilience configuration used the standard handler defaults for all services. Three settings were tuned after observing production behaviour:

Attempt timeout for the notification service was lowered from 1 second to 300ms. The notification service was consistently the slowest upstream at p95 — 280ms. With a 1-second attempt timeout and two retries, a slow notification call could hold a BFF aggregation for up to 3 seconds before returning null. At 300ms, the first slow call times out quickly, the retry fires once, and if it also times out the total operation completes in 900ms — within the acceptable budget for a non-critical service.

Circuit breaker MinimumThroughput was raised from 3 to 5 for the user service. At 3 requests, a brief burst of three 500 errors — which occurred during a weekly maintenance window on the user service — opened the circuit breaker and blocked all dashboard loads for 20 seconds. Five requests provide a more stable signal that distinguishes a sustained failure from a brief transient event.

Retry jitter was essential under load. The initial configuration used linear backoff without jitter. During a load test, all requests to the course service that encountered a 503 retried at exactly 200ms, 400ms intervals — producing a coordinated retry storm that overwhelmed the course service recovery. Adding UseJitter = true spread the retries across the delay window and eliminated the storm pattern.

Resilience as a design conversation, not a configuration detail

The configuration decisions above — which services get retries, what the timeout budgets are, where the circuit breaker thresholds sit — are not technical decisions made in isolation. They represent a negotiation between what the BFF can tolerate, what the upstream services can withstand, and what the user experience requires.

A retry that is safe from the BFF's perspective may be harmful from the upstream's perspective if it doubles the load during a partial outage. A timeout that is acceptable for a background operation is unacceptable for a user-facing request on the critical path. A circuit breaker threshold that is appropriate for a high-traffic service is too sensitive for a low-traffic service where three failures in 30 seconds is statistically insignificant.

These thresholds should be reviewed with the teams that own the upstream services, set against measured p95 latencies from production telemetry, and revisited when service characteristics change. A resilience configuration that has not been reviewed in six months is a configuration that no longer reflects the system it is protecting.

BFF Resilience Patterns: Circuit Breakers, Retries & Timeouts with Polly

The resilience problem, stated precisely

Installing the right packages

Understanding the strategy execution order

Per-client resilience configuration

Wiring per-client pipelines in Program.cs

Handling resilience exceptions in typed clients

The aggregator: partial failure as a first-class outcome

The retry problem: when not to retry

Idempotency keys for safe write retries

Bulkhead isolation: containing failure domains

Testing resilience behaviour

Observing resilience behaviour in production

The settings that required tuning in production

Resilience as a design conversation, not a configuration detail

☰ Series navigation

Comments

The Frontend's Contract: Building Backends for Frontends

Brownfield Migration: The Strangler Fig Approach to BFF Adoption

More from this blog

Brownfield Migration: The Strangler Fig Approach to BFF Adoption

Caching in the BFF: In-Memory, Redis & Response Caching

Observability for BFF: Structured Logging, Distributed Tracing & Azure Application Insights

Testing the BFF: Unit, Integration & Contract Tests

Command Palette

The resilience problem, stated precisely

Installing the right packages

Understanding the strategy execution order

Per-client resilience configuration

Wiring per-client pipelines in Program.cs

Handling resilience exceptions in typed clients

The aggregator: partial failure as a first-class outcome

The retry problem: when not to retry

Idempotency keys for safe write retries

Bulkhead isolation: containing failure domains

Testing resilience behaviour

Observing resilience behaviour in production

The settings that required tuning in production

Resilience as a design conversation, not a configuration detail

☰ Series navigation

Comments

The Frontend's Contract: Building Backends for Frontends

Brownfield Migration: The Strangler Fig Approach to BFF Adoption

More from this blog