A blue/green race condition in Azure Front Door

Front Door route updates aren't atomic, and propagation can take 20-45 minutes. Here's how that poisoned our edge cache during a blue/green deploy and the fix that makes the race structurally impossible.

The point of versioned static assets is that the version is the cache key. app.a4f9.css is supposed to mean exactly one thing forever. Change the file, change the hash, change the URL. Caches stop being a problem.

That's the theory. In practice, on a portfolio of 25+ app services and functions fronted by Azure Front Door Premium with a blue/green app service pair model, we shipped a release and a slice of users started seeing pages with the right HTML and the wrong CSS. Mismatched layout. Broken components. Same versioned CSS URL, different bytes depending on who you asked.

This is the story of what was actually happening, why versioned filenames don't save you here, and the fix.

The setup

Standard enterprise .NET shop. The relevant pieces:

ASP.NET Core apps hosted on Azure App Services
Two app services per workload: a "blue" set and a "green" set, not deployment slots within a single app service
Two origin groups in Front Door: one pointing at the blue app services, one at the green
A batch of Front Door routes that includes a traffic route (which origin group serves users) and a cache route (which rule applies caching policy to static assets)
ASP.NET Core build pipeline emits content-hashed filenames for CSS/JS bundles
Deploy flow: deploy to the inactive set (say green), warm it up, run a Front Door route update batch that flips the traffic route to green and updates the cache route to match. Done.

The model is simple. The active set serves traffic. We deploy to the inactive set. We update Front Door routes to flip which set is active. The two sets are completely separate App Service resources, so the deploy itself is risk-free - the new code isn't reachable until the route update lands.

That's the model. The reality has more moving parts than the model accounts for.

The symptom

Reports trickled in over about 30 minutes post-deploy: a subset of users seeing the new HTML rendered with what looked like the previous build's stylesheet. Not stale CSS - wrong CSS. The browser was loading app.{newhash}.css and getting bytes that didn't match the stylesheet the build had produced for that hash.

The first instinct is always "browser cache." It wasn't. Fresh sessions, incognito, different geographies - same problem. The bad response was cached upstream of the user.

What's actually happening

Front Door route updates aren't atomic. When you push a batch of route changes - even a single API call updating multiple rules - those changes propagate across AFD's global edge fleet over a window measured in 20 to 45 minutes. That's not a typo. Eventual consistency on deployment-critical routing config, on the order of half an hour, is the operating reality of a globally distributed CDN. Critically, the rules in the batch don't all flip at the same instant on a given POP. One rule's update can land before another's.

Now overlay the blue/green flip. The deploy involves a batch of route changes:

The traffic route, which decides which origin group (blue or green) handles incoming requests
The cache route, which controls caching behavior for static assets

Both are being updated to point at the new active set. Both are in the same batch. They are not guaranteed to update together at any given POP.

Here's the race:

The route update batch is published.
At a given POP, mid-propagation, the traffic route has already flipped to the new set (green) but the cache route is still scoped to the old set (blue).
A user lands on that POP. The new HTML is served - references app.{newhash}.css, which is the new build's asset.
The browser requests app.{newhash}.css. The cache route (still pointing at blue) handles the request and forwards to the old origin group.
The old app services don't have app.{newhash}.css. What they return depends on the static file pipeline - a 404, a fallback bundle, default content, whatever it is, it's the wrong response.
That wrong response gets cached at the POP, keyed by the new hash URL.
Any user routed through that POP for that asset URL now gets the cached wrong response. For the lifetime of the cache TTL.

The versioned filename didn't save us because the version isn't unique to an origin. Both origin groups can be asked for the same versioned URL, and they'll answer differently - or one of them won't have it at all and will answer with something worse than nothing.

Why this isn't obvious

The mental model that makes versioned assets feel safe is: content hash means same URL implies same bytes, forever. That's true at the build artifact layer. It's not true at the routing layer when two origins with different artifacts can both be reached by the same URL during a transition window.

The other implicit assumption is that route updates are atomic. That assumption is reasonable for a single load balancer with a single routing table. It's wrong for Front Door, which is a globally distributed CDN with eventual consistency on rule propagation. The docs don't make this loud, but it's structural to how the product works - global propagation is a feature, not a bug, and it's the price of having edge POPs everywhere. The 20–45 minute window isn't a bug to fix; it's the contract.

Most blue/green guides skip this because they assume one of:

No CDN caching (so propagation isn't observable to the cache)
Atomic rule updates (true for some load balancers, not for AFD)
Single-origin deployments (where there's nothing to misroute to)

When you have all three, the race window doesn't matter. When you have AFD Premium with caching, multi-origin blue/green, and a multi-rule update batch, it does.

The fix: set-prefix cache key disambiguation

The shape of the fix is to make the cache key - not the artifact path on origin - unambiguous about which set the asset belongs to.

Each app service, at runtime, prefixes the asset URLs it emits in HTML with its own set identity:

HTML served from the blue set references /static/blue/app.{hash}.css
HTML served from the green set references /static/green/app.{hash}.css

Front Door is configured to strip the set prefix at the edge before forwarding the request to origin. Origins never see prefixed paths. They serve /static/app.{hash}.css exactly as they always have.

So the prefix exists for one purpose: to force Front Door to treat blue's and green's identically-hashed assets as distinct cache entries. The cache key includes the prefix; the origin request doesn't.

Trace through the race with this in place:

Route update batch published. At some POP, traffic route flipped to green, cache route still scoped to blue.
User hits the POP. New HTML served from green references /static/green/app.{newhash}.css.
Browser requests /static/green/app.{newhash}.css. The cache route (lagging, still scoped to blue) handles the request.
AFD's cache key for this request is /static/green/app.{newhash}.css - a key that has never been polluted, because no blue-served HTML has ever referenced that path.
AFD strips the green prefix and forwards /static/app.{newhash}.css to the origin group the cache route currently points to (blue).
The blue origin doesn't have that asset. Returns whatever it returns for missing files.
AFD caches that response - but against the cache key /static/green/app.{newhash}.css.

Once the cache route catches up and points to green, the next request for that same URL hits green, gets the right CSS, and from that point forward the cache serves correctly.

The cache key collision that caused the bug doesn't happen. The two sets' cache entries occupy different namespaces. The race can still fire on the way in, but it can't poison entries that real users will hit and it can't pollute green's cache with blue's content (or vice versa) because the prefix forces them apart at the cache layer.

A few practical notes:

The prefix is in the URL path, not a query string. AFD's default cache key behavior on query strings varies; path is unambiguous.
No origin-side static file middleware changes are needed. The prefix is stripped before requests reach origin. The fix is a runtime URL emission change in the app plus an AFD rule to strip the prefix.
Determining the runtime set identity in each app service is the implementation detail to get right. App settings, environment variables, or a startup probe against a known endpoint all work. Whatever the source, it has to be reliable on cold start - getting it wrong means emitting the other set's prefix and self-poisoning your own cache.
The fix doesn't depend on Front Door's behavior changing. Even if propagation got faster or more atomic tomorrow, the namespace separation continues to protect against any future race in the same shape.

Lessons

A few things I'm taking from this:

The bug is about cache keys, not artifacts. The intuitive fix shape is "make the artifacts namespace-separate" - give each set its own asset directory on origin, route accordingly. That works, but it's overkill. The deeper insight is that the cache key is the only place the collision actually happens. Fix it there, and you can leave origins untouched.

Versioned filenames protect against client-side cache invalidation, not against multi-origin cache poisoning. Those are different problems. The first is solved by content hashes. The second is solved by cache-key disambiguation.

Front Door route updates are eventually consistent, not atomic, and the window is measured in tens of minutes. A batch of rule changes is not a transaction. Different rules in the same batch can land at different times on the same POP, and "different times" can mean half an hour apart. Any deployment strategy that relies on multiple rules flipping together is exposed to this for the entire propagation window.

Blue/green at the edge is not blue/green at the origin. Origin-side, the swap is a route configuration change that's complete the moment the API call returns. Edge-side, propagation is observable for 20–45 minutes, and any cache between the user and the origin needs to be reasoned about during that window.

The race window will find you eventually. This bug existed for many deploys before symptoms surfaced widely enough to chase. Most deploys, propagation completed before any user happened to land on a mid-propagation POP and request a new asset URL through a lagging cache route. The fact that a bug doesn't fire often is not evidence the design is sound. It's evidence the race conditions are rare per deploy.

Set-prefix cache disambiguation should probably be the default for AFD + multi-origin blue/green. The cost is small - runtime URL emission plus an edge rule to strip the prefix. The protection is structural. The alternative is hoping the route batch propagates uniformly before anyone notices, across a 20–45 minute window, which is not a deployment strategy.

If you're running blue/green via paired app services behind Front Door Premium, and your assets are cached at the edge, this race is in your system right now. You may or may not have noticed it. The fix isn't expensive. Worth doing before you have to write the post-mortem.

A blue/green race condition in Azure Front Door

Front Door route updates aren't atomic, and propagation can take 20-45 minutes. Here's how that poisoned our edge cache during a blue/green deploy and the fix that makes the race structurally impossible.

The setup

The symptom

What's actually happening

Why this isn't obvious

The fix: set-prefix cache key disambiguation

Lessons

Comments

Azure Infrastructure

Command Palette

Front Door route updates aren't atomic, and propagation can take 20-45 minutes. Here's how that poisoned our edge cache during a blue/green deploy and the fix that makes the race structurally impossible.

The setup

The symptom

What's actually happening

Why this isn't obvious

The fix: set-prefix cache key disambiguation

Lessons

Comments

Azure Infrastructure