Troubleshooting
Chat stream keeps disconnecting
Diagnose dropped WebSocket streams from the chat UI or your own SDK consumer.
Chat stream keeps disconnecting
The chat panel shows "reconnecting" every minute, or your custom SDK consumer drops without an error. Find the cause.
The goal
A stable WebSocket that survives long turns and ordinary network blips.
Steps
Check the proxy timeout.
Most "60-second drop" issues are a reverse proxy with a 60-second idle timeout (Nginx default). Raise to at least 5 minutes:
platos.example.com { reverse_proxy webapp:3030 { transport http { dial_timeout 30s response_header_timeout 5m read_timeout 5m write_timeout 5m } } }Force WebSocket transport.
Some networks force long-polling fallback. The SDK respects
transports:const stream = await platos.threads.stream({ threadId, transports: ["websocket"] });Or in the chat UI's URL
?transport=websocket.Check the early-message buffer.
The race-fix invariant in
tool-sync-ws.service.ts:130-135, 281-284buffers early messages during the auth handshake. A custom client that bypasses it can drop the first frames after reconnect. If you wrote your own client, mirror the buffer.Check the SDK reconnect backoff.
@platosdev/clientreconnects with exponential backoff (1s, 2s, 4s, ..., max 30s). After 30 seconds disconnected, it re-fetches message history rather than replay. If you see "lost" messages on a long disconnect, this is why; bridge throughmessages.list.Check the audit log.
monitoring/admin-auditshowsauth.session_token.expiredevents. Session tokens are 5 minutes; long streams need refresh pings. The SDK does this automatically; raw WebSocket clients must implement.
Verify
- Long turns (>1 minute) complete without UI flicker.
- The browser's network tab shows the WebSocket as
Upgradedand persistent. - A simulated network drop reconnects within 5 seconds.
Common findings
- Cloudflare or similar in front with default 100-second timeout. Configure WebSocket-friendly timeouts.
- Corporate proxy that strips
Upgradeheaders; transport falls back to long-polling and adds latency. - A misbehaving load balancer health check that closes the socket on health probe.
Next steps
- Trace a single turn to confirm the turn itself completed engine-side.
- Connect an entity (TypeScript) for the entity-side equivalent of the same patterns.
