What We’re Testing
The Health page subscribes to real-time machine updates using the api.channel() API from src/lib/api/client.ts. The subscription is set up in HealthPage.tsx:
const channel = api
.channel("health-machines-rt")
.on("postgres_changes", { event: "*", schema: "public", table: "machines",
filter: `org_id=eq.${currentOrg.id}` }, () => fetch())
.subscribe();
When a WebSocket event arrives on the machines channel, the callback re-runs the full machine list fetch from GET /api/db/machines. The page does not attempt to apply incremental patches — it always re-fetches everything.
The WebSocket connection is established by RealtimeChannel.subscribe() in the client:
- Opens
wss://login.quickztna.com/api/realtime?org_id=<org_id>&token=<jwt> - Sends
{ "type": "subscribe", "channels": ["machines"] }on open - Server (
ws-manager.ts) sends{ "type": "subscribed", "channels": ["machines"] }in response - Events arrive as
{ "type": "event", "channel": "machines", "event": "UPDATE"|"DELETE", "payload": {...} }
On the server side, events are published to the WebSocket channel by two handlers:
machine-heartbeat.tscallsbroadcastEvent(env, org_id, "machines", "UPDATE", {id, status, last_seen})— but only whenstatuschanges, not on every heartbeatcleanup-machines.tscallsbroadcastEvent(env, org_id, "machines", "UPDATE", {id, status: "offline"})for each machine it marks offline, andbroadcastEvent(env, org_id, "machines", "DELETE", {id})for deleted ephemeral machines
The Valkey pub/sub layer (ws:org:<org_id> channel) ensures cross-server delivery between the production and standby nodes.
WebSocket fallback: If the WebSocket connection fails (network error or onerror event), the client falls back to 30-second interval polling, calling all registered callbacks on each tick.
Heartbeat timeout: The server-side ws-manager.ts evicts clients that have not sent a ping message within 90 seconds (HEARTBEAT_TIMEOUT = 90_000). The server checks every 30 seconds (HEARTBEAT_INTERVAL = 30_000). Clients that are evicted receive close code 4000.
Your Test Setup
| Machine | Role |
|---|---|
| ⊞ Win-A | Browser — Health page open, observing real-time updates |
| 🐧 Linux-C | VPN source — start/stop to trigger status events |
ST1 — WebSocket Connects and Subscribes to Machines Channel
What it verifies: Opening the Health page establishes a WebSocket connection and successfully subscribes to the machines channel.
Steps:
-
On ⊞ Win-A , open browser DevTools (F12) and go to the Network tab. Filter by
WS(WebSocket). -
Navigate to
/health(or reload if already there). -
In DevTools, find the WebSocket connection to
wss://login.quickztna.com/api/realtime. -
Click the connection entry and go to the Messages tab.
-
You should see the outgoing subscribe message sent immediately after the connection opens:
{"type":"subscribe","channels":["machines"]}
- You should receive the server’s confirmation:
{"type":"subscribed","channels":["machines"]}
- You should also have received a
connectedwelcome message shortly before the subscribe:
{
"type": "connected",
"clientId": "uuid",
"availableChannels": ["machines","members","audit","keys","general","posture","threats","certificates","dns","acl","billing","remote_desktop"]
}
Pass: WebSocket connects, subscribe message is sent, subscribed confirmation is received.
Fail / Common issues:
- No WebSocket connection visible — the channel name
health-machines-rtis used as an internal key only; the actual WS URL is constructed fromcurrentOrg.id. IfcurrentOrgis not set, the channel falls back to 30-second polling (no WS connection is made). Verify org context is active. 401response on upgrade — the JWT token may be expired. Log out and back in to refresh the session.- Connection immediately closes — check that the
org_idbelongs to an org the user is a member of. The server’sisOrgMember()check will close the connection if not.
ST2 — Machine Goes Online: UPDATE Event Triggers Re-fetch
What it verifies: When a machine starts up and sends its first heartbeat (causing a status change from offline to online), the server broadcasts an UPDATE event and the Health page re-fetches and updates.
Steps:
-
On ⊞ Win-A , open the Health page. Keep DevTools open on the WebSocket Messages tab.
-
Ensure 🐧 Linux-C is currently offline:
curl -s "https://login.quickztna.com/api/db/machines?org_id=$ORG_ID&select=name,status" \
-H "Authorization: Bearer $TOKEN" | python3 -m json.tool
- On 🐧 Linux-C , start the VPN:
ztna up
-
Within 5-10 seconds, watch the Health page on ⊞ Win-A . The Linux-C row should update from grey (offline) to green (online) without a manual page refresh.
-
In DevTools WebSocket Messages, confirm you received an event similar to:
{
"type": "event",
"channel": "machines",
"event": "UPDATE",
"payload": {
"id": "machine-uuid",
"status": "online",
"last_seen": "2026-03-17T10:30:45.123Z"
},
"timestamp": "2026-03-17T10:30:45.200Z"
}
Note: This event is only broadcast when status changes. Routine heartbeats from an already-online machine do not produce a WebSocket event.
Pass: Status change from offline to online triggers a WebSocket UPDATE event. Health page updates within seconds without a manual refresh.
Fail / Common issues:
- Page does not update automatically — WebSocket may have disconnected. Check DevTools for the WS connection state. If disconnected, the page falls back to 30-second polling; the update will appear after the next poll cycle.
- Event is received but page still shows old data — the re-fetch callback may have hit a transient network error. Check the Network tab for a failed
/api/db/machinesrequest after the WS event.
ST3 — Machine Goes Offline: UPDATE Event from Cleanup Job
What it verifies: When the cleanup cron job marks a machine offline (after 3 minutes of missed heartbeats), it broadcasts an UPDATE event that triggers a Health page re-fetch.
Steps:
-
On ⊞ Win-A , Health page open, DevTools WebSocket Messages visible.
-
On 🐧 Linux-C , kill the VPN process without graceful shutdown:
sudo pkill -9 ztna
- Wait up to 3 minutes. The cleanup job (
cleanup-machines.ts) runs periodically and executes:
UPDATE machines SET status = 'offline'
WHERE status = 'online' AND last_seen < NOW() - INTERVAL '3 minutes'
After the job runs, it calls broadcastEvent(env, org_id, "machines", "UPDATE", {id, status: "offline"}) for each machine it marks offline.
-
Watch the Health page on ⊞ Win-A . The Linux-C row should transition to grey/offline automatically.
-
In DevTools, confirm the
UPDATEevent:
{
"type": "event",
"channel": "machines",
"event": "UPDATE",
"payload": {
"id": "machine-uuid",
"status": "offline"
}
}
Pass: After ~3 minutes of missed heartbeats, cleanup job broadcasts UPDATE event, Health page shows offline status automatically.
Fail / Common issues:
- Page takes longer than 3 minutes to update — the cleanup cron job’s run frequency determines the lag. The job is scheduled in
backend/src/cron.ts; check its interval if updates are delayed. - The graceful
ztna downcommand sends an offline heartbeat synchronously, so status changes immediately (viabroadcastEventinmachine-heartbeat.ts). The 3-minute delay only applies to hard kills.
ST4 — WebSocket Fallback to 30-Second Polling
What it verifies: If the WebSocket connection is unavailable, the client falls back to periodic polling and the Health page still reflects machine changes (with up to 30-second delay).
Steps:
-
On ⊞ Win-A , open DevTools. Go to the Network tab, select Throttling, and choose Offline to simulate a network interruption — or use the Application tab to block the WebSocket URL.
Alternatively, use browser DevTools to simulate a WebSocket failure:
- Open the Console tab
- The
RealtimeChannelclass handlesonerrorandoncloseby starting a 30-second polling interval
-
Wait for the WebSocket close event. After that the client starts polling every 30 seconds (the
_startPolling()method fires all callbacks on each tick). -
Restore normal network. On 🐧 Linux-C , toggle the VPN:
ztna down
sleep 5
ztna up
-
On ⊞ Win-A , observe the Health page. Without WebSocket, the status change will appear within 30 seconds (next poll cycle), not immediately.
-
After the network is restored, the client will attempt WebSocket reconnection after 5 seconds (
_reconnectTimer = setTimeout(..., 5000)).
Expected: Status changes are reflected within 30 seconds when operating in polling fallback mode.
Pass: Page updates machine status during polling fallback mode, within one 30-second poll interval.
Fail / Common issues:
- Page never updates without WebSocket — verify the polling timer is active. If the channel was disposed (
removeChannelcalled) before the WebSocket error, polling will not start. - Reconnection does not occur — the 5-second reconnect timer requires the browser to remain on the
/healthpage. Navigation away disposes the channel.
ST5 — Server-Side WebSocket Heartbeat and Eviction
What it verifies: The server evicts WebSocket clients that have not sent a ping within 90 seconds, closing the connection with code 4000.
Steps:
-
On ⊞ Win-A , open the Health page with DevTools WebSocket Messages visible.
-
The browser client in
RealtimeChanneldoes not automatically sendpingframes — it only sendssubscribe/unsubscribe/broadcastmessages. The server’s 90-second timeout (HEARTBEAT_TIMEOUT = 90_000) measures the time sincelastPingwas set, which initialises toDate.now()at connection open.This means if the client never sends a
ping, it will be evicted after 90 seconds. -
To manually test the ping/pong exchange, use the browser console to send a ping to the active WebSocket:
// Find the active WebSocket — this is internal to the client module
// but you can observe it via DevTools
// In DevTools Network → WS → Messages, manually type a message
// (some browsers support this in the "Send" box)
{"type":"ping"}
- The server should respond with:
{"type":"pong","timestamp":"2026-03-17T10:31:00.000Z"}
-
To observe the eviction: open the Health page, wait 91 seconds without any user interaction, and watch DevTools for a WebSocket close event with code 4000 and reason
Heartbeat timeout. -
After eviction, the client’s
onclosehandler fires and schedules a reconnection after 5 seconds.
Expected: Server sends pong in response to ping. Clients that go silent for 90 seconds receive close code 4000. The client reconnects automatically within 5 seconds.
Pass: ping receives pong. After 90 seconds of silence, connection closes with code 4000. Reconnection happens within 5 seconds.
Fail / Common issues:
- No eviction observed after 90 seconds — the server heartbeat check runs every 30 seconds (
HEARTBEAT_INTERVAL = 30_000), so eviction can take up to 30 seconds beyond the 90-second timeout threshold (i.e., up to 120 seconds total). - Close code is not 4000 — normal TCP-level disconnections use code 1001 or 1006. Code 4000 specifically means the server’s heartbeat check evicted the client.
Summary
| Sub-test | What it proves | Pass condition |
|---|---|---|
| ST1 | WebSocket connects and subscribes | connected welcome + subscribed confirmation received in DevTools |
| ST2 | Online transition triggers UPDATE event | ztna up → WebSocket UPDATE event → page updates without refresh |
| ST3 | Offline detection via cleanup job | Hard kill → 3-min cleanup job → UPDATE event → page shows offline |
| ST4 | Polling fallback | WebSocket failure → 30-sec poll interval → updates within 30 sec |
| ST5 | Server-side heartbeat eviction | Idle client evicted after 90 sec with close code 4000; reconnects in 5 sec |