Unplug and Go — Hub-Spoke Topology, Role Promotion, and Five-Minute Failover in xGrid
Blog/

Unplug and Go — Hub-Spoke Topology, Role Promotion, and Five-Minute Failover in xGrid

Any Spoke can become a Hub in minutes. A failed Hub gets replaced in five. Every Raspberry Pi ships with the full stack pre-installed — the role is just a config file. This is how xGrid handles topology changes in disconnected environments.

A Pattern You Already Know

You may not have heard the term "Hub-Spoke topology," but you use it every day.

Open any airline's route map. A few large nodes — Denver, Los Angeles, Chicago — radiate dozens of connections to smaller cities. The large nodes are Hubs. The small cities are Spokes. Instead of flying direct from every city to every other city (which would require 435 routes for 30 cities), all traffic flows through a handful of hubs. Fewer routes, more coordination, vastly more efficient.

Comparison of point-to-point network (top) versus hub-and-spoke network (bottom) — the hub-spoke model reduces the number of connections by routing through a central node

Point-to-point (top) vs Hub-and-spoke (bottom): routing through a central hub reduces connections dramatically. Source: Wikipedia (public domain)

Logistics works the same way. FedEx routes everything through Memphis. A package from Taipei to Kaohsiung might fly through Memphis first — seemingly absurd, but centralized sorting is more efficient than point-to-point relay.

The pattern is everywhere in information systems too: a central node coordinates multiple edge nodes.

But traditional Hub-Spoke has a fatal assumption: the Hub is always online. Flights can wait for the hub airport to reopen. Packages can wait for the sorting center. But in a mass casualty incident, if the Hub goes down, patients cannot wait.

xGrid's Hub-Spoke makes two critical modifications: every Spoke is a complete system, not just a terminal. And — any Spoke can promote itself to a new Hub in minutes.

The Physical Layout: One Backbone, Many Satellites

xGrid's deployment is not two boxes and a cable. It is a scalable topology built around an Ethernet switch:

                 ┌─────────────────────┐
                 │  Hub A               │
                 │  CIRS + MIRS + HIRS  │
                 │  WiFi Hotspot        │
                 │  mDNS broadcast      │
                 └──────────┬──────────┘
                            │
                  ┌─────────┴─────────┐
                  │  Ethernet Switch   │
                  └──┬────┬────┬────┬─┘
                     │    │    │    │
                  ┌──┴┐┌──┴┐┌──┴┐┌──┴┐
                  │ B ││ C ││ D ││ E │  ← Spokes
                  └───┘└───┘└───┘└───┘
                  Each has its own WiFi hotspot
                  Each runs full MIRS
                  Each holds a recent CIRS snapshot

The Hub runs the resource system (CIRS), clinical system (MIRS), and home inventory (HIRS). It broadcasts a WiFi hotspot — iPads connect to it for all clinical operations. Multiple Spokes connect to the Hub through an Ethernet switch, each running MIRS to manage its own station's inventory.

The switch is the data backbone — but it is not a single point of failure. If the switch dies, every RPi still has its own WiFi hotspot. iPad-based clinical operations continue uninterrupted at every station. Only inter-station synchronization pauses.

Two separate network layers, by design: Ethernet for data sync between stations. WiFi for clinical operations within each station. One can fail without affecting the other.

Every RPi Is a Complete System — The Golden Image

This is the most consequential design decision in the entire architecture: every Raspberry Pi ships with everything pre-installed.

CIRS (resource), MIRS (clinical), HIRS (home inventory) — all present on every SD card. The role is not determined by hardware. It is determined by a single configuration file: /etc/xgrid/role.conf. Change one field and a Spoke becomes a Hub.

This means you do not stock "Hub units" and "Spoke units." You stock identical spares. Any unit can replace any other unit. A box of five Raspberry Pis in your logistics container is not five specific parts — it is five interchangeable nodes that can assume whatever role the situation demands.

In Spoke mode, CIRS stays installed but does not start — saving memory, CPU, and avoiding conflicts. But it is there, ready to wake up the moment it is needed.

Tiered Deployment — From Forward Station to Medical Center

Not every deployment needs five machines. The topology scales to match the mission:

EchelonConfigurationConnectivityConcurrent Tablets
Medical Center1 Hub + 4 Spokes8-port switch~75
Regional Hospital1 Hub + 3 Spokes5-port switch~60
District Hospital1 Hub + 1 SpokeDirect cable~30
Forward Station1 StandaloneNone required~15

A forward station is one RPi, one power source, one iPad. That is a complete medical information system. Need to scale up? Bring another RPi, plug in an Ethernet cable, and it becomes a Spoke automatically.

The same Golden Image works at every echelon. The difference is how many units you deploy and what roles you assign them — not what software they carry.

Disconnection Is Not Failure — It Is the Expected State

In a traditional system, losing the network link triggers alerts, degrades service, and initiates recovery procedures.

In xGrid, losing the network link triggers nothing visible to clinical staff. Both systems continue operating with full functionality — their own databases, their own clinical interfaces, their own tablet stations.

This works because of a fundamental design principle: every Spoke is a complete system. The Hub provides coordination, not capability. When coordination is lost, the only thing that changes is synchronization timing.

Three-Phase Synchronization

When the Ethernet backbone is healthy, the Hub and Spokes synchronize using a three-phase process:

Phase 1 — Verify: The Hub checks each Spoke's health. No response within 30 seconds — skip. A clock-alignment check ensures both devices agree on the current time. If they differ by more than 30 seconds, synchronization is refused to prevent timestamp corruption.

Phase 2 — Push (Hub to Spokes): Clinical events flow outward: patient records, registrations, prescriptions, vital signs, handoff records. CIRS is the authority for patient data. Separately, the Hub pushes a full CIRS database snapshot to every Spoke every five minutes. These snapshots are stored locally on each Spoke — the twelve most recent copies, a rolling one-hour backup window. Each snapshot ships with a metadata sidecar containing its sha256 hash, schema version, and the Hub's current epoch number.

This five-minute snapshot cadence is the foundation of the two capabilities described next.

Phase 3 — Pull (Spokes to Hub): Resource events flow inward: inventory changes, blood bank operations, surgery records, dispensing logs. MIRS is the authority for supply data.

Synchronization is incremental — only changes since the last sync. The snapshots in Phase 2 are the full-database safety net underneath.

Six Conflict Resolution Strategies

Two devices modify the same record during a disconnection. When they reconnect, which version wins?

The answer depends on what the data is:

StrategyData TypesLogic
Append bothVital signs, handoffs, dispensing recordsImmutable events — keep both versions
Newest winsPatient demographicsCompare timestamps, most recent update prevails
Hub winsRegistrations, prescriptions, surgery recordsCIRS (Hub) is authoritative
Sum bothInventory quantitiesAdd both sides' consumption together
Always blockBlood products, controlled substancesNever auto-resolve — require human verification
On-site winsEquipment statusThe operator physically present takes precedence

Sum both for inventory: The Hub consumed 5 bandages, the Spoke consumed 3. The correct answer is not "whoever updated last" — that would erase one side's consumption. The correct answer is 5 + 3 = 8 consumed.

Always block for blood products: A blood unit marked as "issued" on both stations simultaneously cannot be resolved by any automated rule. Someone needs to physically verify where that blood unit actually is. The system flags the conflict and waits for a human decision.

Unplug and Go — Spoke Promotes to Hub

This is the single most powerful capability in the architecture.

Picture this: you are running a 1 Hub + 3 Spoke deployment at a mass casualty incident. Two hours in, the incident commander reports a second casualty collection point ten kilometers away. You need a functioning medical station there now.

You walk to one of the Spokes. Unplug its Ethernet cable. Pack it into a bag with a battery and an iPad. Drive to the new site. Plug in power. SSH in and run one command:

sudo xgrid-promote

The promote process is an atomic state machine. It writes a promoting state to the config, then executes each step in sequence: load the latest verified CIRS snapshot, start CIRS, bring up the WiFi hotspot, assign the Hub's static IP, broadcast its presence via mDNS. If any step fails, the entire operation rolls back — the unit returns to Spoke mode with the failure logged. No half-states, no manual cleanup.

When it succeeds, the iPad connects to the new WiFi network, opens the PWA, and you are looking at a fully operational medical station carrying the patient data from the original Hub — at most five minutes old.

The original Hub keeps running with one fewer Spoke. The new Hub runs independently. Two stations, each with its own WiFi coverage, its own CIRS, its own patient intake. When the mission ends, data merges back together.

No pre-planning required. No special hardware. Any Spoke, at any time, can walk away and become a Hub.

Hub Down — Takeover in Five Minutes

Now the harder scenario: the Hub's hardware fails. Power supply burned out, SD card corrupted, or the ceiling collapsed on it.

Every Spoke continuously monitors the Hub's heartbeat — once every 30 seconds. Three consecutive failures (90 seconds of silence) trigger a red banner on the Spoke's PWA: "Hub offline."

An operator makes the call: which Spoke takes over? They tap the promote button on the PWA or run xgrid-promote via SSH.

The promote script loads the latest.good snapshot — a symlink that always points to the most recently verified backup. "Verified" means the sha256 hash in the metadata sidecar matches the actual database file, and the schema version is compatible with the current CIRS installation. Corrupted or tampered snapshots never make it into the candidate pool.

Once loaded, the new Hub starts CIRS, opens its WiFi hotspot, and begins broadcasting _xgrid-hub._tcp via mDNS. Remaining Spokes discover the new Hub automatically (more on this below) and resume synchronization.

The maximum patient data loss is five minutes — the interval between snapshot pushes. In a high-volume surge, operators can trigger an immediate sync (sync_push.py --now) to bring the RPO even lower.

Why not promote automatically? Because in a disconnected environment, you cannot distinguish "the Hub is destroyed" from "the Ethernet cable is loose." Automatic promotion risks creating two Hubs that both think they are in charge — a split-brain condition. That leads to divergent patient records that are far more dangerous to reconcile than a brief pause in service. Promotion must be a human decision.

Split-Brain Protection — Epochs, Not Trust

"Do not promote automatically" is a policy. Policies can be violated under pressure. xGrid adds a mechanical safeguard: the hub epoch.

Every time a Spoke promotes to Hub, the epoch counter in role.conf increments. Epoch 1 is the original Hub. Epoch 2 is the first promotion. Epoch 3 is the next. The epoch is embedded in every snapshot, every mDNS broadcast, every sync handshake.

This creates three layers of protection:

Zombie detection. Suppose Hub A (epoch 1) crashes and Spoke B promotes (epoch 2). Later, someone plugs Hub A back in and it boots up. On startup, it scans the network and discovers a Hub broadcasting epoch 2 — higher than its own epoch 1. Hub A automatically demotes itself to Spoke. No human intervention needed. The stale Hub surrenders without being asked.

Spoke validation. When a Spoke reconnects, it checks the epoch of every Hub it can see. If it finds two Hubs with different epochs, it does not silently pick one. It raises an orange alert: "Multiple Hubs detected — contact the administrator." The Spoke refuses to auto-reconnect until a human resolves the ambiguity.

Cluster isolation. Every deployment carries a unique cluster_id — a UUID generated when the cluster is first created. Spokes only accept Hubs with a matching cluster_id. A Spoke from your deployment will not accidentally connect to the medical station next door, even if they are on the same network segment.

The epoch system does not prevent split-brain entirely — if two isolated subgroups each promote a Hub with no network between them, you will still have two independent Hubs. But it guarantees that the moment those subgroups reconnect, the conflict is detected and the lower-epoch Hub stands down. The hard problem is not preventing split-brain; it is detecting and resolving it before it causes harm.

Service Discovery — No Hardcoded IPs

When a new Hub comes online after promotion, the remaining Spokes need to find it. In traditional systems, you would update a configuration file on every Spoke with the new Hub's IP address. In a disaster, that is unacceptable overhead.

xGrid uses mDNS (multicast DNS) via avahi-daemon. When a Hub starts — whether it is the original or a newly promoted one — it broadcasts _xgrid-hub._tcp on the local network, advertising its IP, its hub epoch, its cluster ID, and its station name.

Spokes listen for this broadcast. When they detect a new Hub with a valid cluster ID and a higher epoch than the last known Hub, they update their connection target automatically. The PWA follows the same logic: if the API endpoint goes unreachable and a new Hub appears on mDNS, the PWA switches its API base URL and shows a green toast: "Connected to {station_name}@epoch{N}."

No configuration files to update. No IP addresses to remember. The network is self-describing.

This only works within the same Layer 2 broadcast domain — which is exactly what an Ethernet switch provides. Every RPi connected to the same switch can discover every other RPi. If you are using a managed switch with VLANs, additional configuration is needed — but in a field deployment, an unmanaged switch is the norm.

Headless — Your iPad Is the Control Panel

xGrid's Raspberry Pis never have monitors. Never have keyboards. No HDMI cable, no USB peripherals. The USB ports stay empty.

All clinical work happens through PWAs over WiFi. Each RPi runs its own WiFi hotspot, covering its physical area. A nurse walks into a triage tent, connects to that tent's WiFi, opens the PWA in Safari, and starts working. Move to the next tent — connect to the next hotspot.

System administration happens over SSH. A laptop on the Hub's WiFi network can manage the entire topology: check station status, trigger sync, promote or demote nodes, inspect logs. One person with a laptop can administer five stations from a folding chair.

The total hardware per station: one Raspberry Pi, one power cable, one Ethernet cable (if not standalone). That is it. No monitor stand, no keyboard tray, no desk. The deployment footprint is small enough to fit in a backpack.

Station Consolidation — When a Site Evacuates

When a station must evacuate, its data needs to merge into a surviving station. Four merge modes handle different scenarios:

  • Full merge: All data flows into the target station
  • Partial merge: Only selected resource categories transfer
  • Backup import: Restore from a portable backup
  • Emergency close: Station shutdown with complete data preservation

If the evacuating station was a promoted Spoke (now acting as an independent Hub), it first demotes back — stopping CIRS, shutting down its WiFi hotspot, switching back to DHCP — then reconnects to the surviving Hub's switch to begin the merge.

Every merge records exactly what moved: how many inventory items, blood products, equipment records, and surgery records transferred. This audit trail answers the post-disaster question that always arises: "When that station evacuated, where did everything go?"

Designing for Disconnection

Most distributed systems start from the premise "the network is reliable" and add exception handling for when it is not.

xGrid starts from the premise "the network is unreliable" and optimizes for when it happens to work.

This inversion produces fundamentally different design:

  • Every node is a complete system (not a thin client that depends on a server)
  • Every unit ships with all software pre-installed (the role is a config file, not a hardware variant)
  • Any Spoke can promote to Hub (there is no "special Hub machine")
  • Synchronization is periodic batch (not real-time streaming)
  • Conflict resolution is default behavior (not exception handling)
  • Human intervention is the correct answer for some conflicts (not a bug to eliminate)
  • Promotion is a human decision (not an automated reaction — because split-brain is more dangerous than a brief pause)
  • Stale Hubs demote themselves (because the epoch counter is a fact, not a policy)

The cable will get kicked. The switch will get knocked off the table. The Hub will take a hit.

None of these are failure modes. They are topology transitions.


Related: Offline-First Is Not a Fallback · ISBAR Is More Than a Handoff Format