Utkarsh M — Principal Software Engineer

Problem

A retail chain needed to push content updates to 12,000+ digital displays across 23 cities. Updates had to be atomic — no screen could show a half-rendered campaign. Connectivity was unreliable. Some screens were offline for hours.

Context

The existing system was a broadcast-based push model. A central server emitted updates. Screens received them or didn't. There was no acknowledgment protocol, no conflict resolution, and no way to know what state any given screen was in at any moment.

Constraint

No cellular upgrade budget. Existing hardware. Updates had to guarantee eventual consistency with no partial-render states. The system had to degrade gracefully — a screen that lost connectivity could not show an error or blank.

The decision

What we chose and why.

Each screen became a state machine, not a receiver. Screens maintained local state and requested diffs when they reconnected. Content was versioned and content packages were content-addressed (hash-based). A screen only applied an update if its local hash matched the expected predecessor.

Tradeoffs

Content-addressed packagesoverMutable content URLs

A screen must be certain that what it downloaded is exactly what was published. Hash-based content makes verification O(1).

Pull-based syncoverPush-based broadcast

Push assumes connectivity. Pull assumes disconnection. When 20% of screens reconnect simultaneously, pull with jittered backoff distributes the load naturally.

Atomic state transitionsoverIncremental asset loading

A screen that shows half a campaign is worse than a screen that shows yesterday's campaign. The previous state is always the safe fallback.

Architecture

Service

Data Store

Queue / Bus

Client

External

Click any node to inspect

The failure

When connectivity restored after a regional outage, 3,000 screens reconnected within a 90-second window. Every screen requested its full sync immediately. The Sync Service received 33 requests per second instead of the expected 2-3. It crashed. All screens fell back to last-known-good state, which was the correct behavior — but the cascade took 20 minutes to resolve.

DISCOVERED — In production. During a planned maintenance window that coincided with an ISP outage.

IMPACT — No screens went dark. No partial renders. But the new campaign launched 20 minutes late across the affected region.

Iteration

Added jittered reconnection windows. Each screen waits a random interval (5–180 seconds) after detecting connectivity before requesting sync. Combined with exponential backoff on the sync service side. Tested at 10× the maximum observed reconnect rate. The thundering herd no longer materializes.

Outcome

99.97% uptime. Sync latency under 3 seconds globally.

Campaigns now go live simultaneously across all 23 cities within seconds. The content team that used to manage deployment windows now schedules content months in advance and forgets about it. Maintenance cost dropped 60% because errors surface before they become incidents.

Lessons

Design distributed systems for the failure path first. The happy path is easy.

Thundering herd is not a theoretical concern. It will happen the day you least want it to.

Content-addressed storage is one of those ideas that looks complicated until you implement it, and then you never go back.

Last-known-good is not failure. It's resilience. Design fallbacks as first-class features.