Back to all posts

Streaming Terminals into the Browser from Agent Sandboxes

We run coding agents in cloud sandboxes. To make a sandbox feel like a real development environment, we need terminals in the browser. Getting a terminal into a browser is straightforward, but making it feel like local development over variable connections, tab reloads, and multiple users is a harder problem. Existing tools don’t quite fit the bill: xterm.js streams raw PTY bytes to the client and reconstructs state there, so every new connection replays history and there's no single canonical screen. tmux and Mosh keep canonical state on the server, but neither speaks a protocol browsers can use. Blit combines both approaches: the server runs Alacritty and owns the screen state, then streams compressed frame diffs to any connected client over any transport. The rest of this post covers how that works: how we stream frames efficiently, how the browser renders them, and how per-client congestion control adapts delivery to the connection at hand.

Indent · Apr 9, 2026 · 4 min read
Streaming Terminals into the Browser from Agent Sandboxes

One server to rule them all

Blit moves the terminal emulator to the server. Each PTY gets its own instance of Alacritty, which parses raw output into a structured cell grid. The server diffs that grid against what each client last saw and sends compressed deltas. Clients just apply the diff and render. A new connection doesn’t replay history, it receives the current frame.

The round-trip looks like this: a keypress travels from client to server, the server updates the grid, and a diff streams back. Because diffs are per-client, the server can send each one at a rate matched to the connection. Because the server owns the full screen state, reconnection is instant: there’s no event log to replay, just the current frame.

This is the same insight behind tmux and Mosh: the server is the source of truth, and clients are views into it. A browser tab, a server-side agent calling blit show, and a CLI session all read the same data structure. The next question is how to make it performant. If the server has to snapshot the terminal, diff it against what each client last saw, compress the delta, and ship it over the network before a keystroke becomes visible, how do you keep that fast enough that typing doesn’t feel laggy?

Blitting at the speed of light

For typing to feel responsive, the server needs to get screen updates to the browser as fast as possible. That means the diffs need to be small and cheap to produce. Blit gets there by treating the terminal as a flat array of fixed-size cells and diffing it at the byte level.

Each cell is a 12-byte struct: the character, its foreground and background colors, and style flags like bold or underline. Because every cell is the same size, comparing two frames is just walking two byte arrays and marking which 12-byte slices differ. A bitmask tracks the dirty cells, one bit per cell, and the payload includes only the ones that changed. On a typical 80x24 terminal, a single keystroke touches one or two cells out of nearly two thousand.

LZ4 compresses better when adjacent bytes in the stream have similar values. To exploit that, the server doesn’t write dirty cells in their natural order. It writes all their first bytes, then all their second bytes, and so on through all twelve positions. Flag bytes land next to flag bytes, red channels next to red channels. A one-character update compresses to well under 100 bytes.

Scrolling is the most common large change, but most of the screen doesn’t actually change, the rows just shift. Rather than retransmitting all of them, the server detects the vertical offset, sends a 13-byte copy instruction to shift the rows, and only patches what’s new. The newly revealed lines are often blank, and when every cell in a row is identical the server sends a single fill instruction instead of the cells individually. A cleared 80-column row goes from 960 bytes to 21.

The fixed 12-byte cell size makes all of this possible, but some characters don’t fit in four bytes of UTF-8. Complex emoji and some CJK sequences store a hash of their content in the cell, and the actual string goes into a side table that’s only included when it’s non-empty. The edge cases stay out of the hot path.

Each client gets its own diff. The server remembers the last frame it delivered to each connection and diffs against that, not against some global previous state. A laptop on fast Wi-Fi and a phone on LTE see the same terminal, but the phone’s updates are relative to where it left off, which might be several frames behind. When a client falls behind and a frame has to be dropped, the server still advances that client’s baseline to the current state. The next diff stays small instead of snowballing.

Reduce, Reuse, Don’t Re-Render

The browser has to keep up with the server’s frame rate or the whole pipeline stalls. Blit renders through a WASM module backed by WebGL, and the two things that matter most are avoiding copies and avoiding redundant work.

The WASM module owns the terminal’s cell grid. When a compressed diff arrives, it decompresses and applies the ops directly to its internal buffer. To prepare a frame for display, it walks the grid once, resolves colors, looks up glyphs, and writes two flat arrays of vertex data: one for background rectangles and one for textured quads that carry the text. These arrays live in WASM linear memory, and the JavaScript renderer reads them as typed array views over the same backing buffer. No copy needed. The only copy in the whole path is the upload to the GPU.

Painting more than once per animation frame is wasted work the user will never see. A program like htop might push several updates between two browser frames. Rather than painting each one, the renderer marks the grid dirty on the first update and collapses everything into a single repaint at the next frame.

Even a full-screen repaint stays fast. The grid is a flat byte array, so the walk is sequential and cache-friendly. Spaces and wide-character continuations skip glyph generation entirely, which on a typical terminal is half the cells. Adjacent cells with the same background color merge into a single rectangle. Glyphs are cached in an atlas, so after the first frame nearly every lookup is a hit. All backgrounds draw in one call, all glyphs in another. A full repaint of a 200x50 terminal is two draw calls.

Full repaints only happen on resize, palette changes, and programs that redraw everything. Scrolling output from cat or git log is cheaper. The server already detected the scroll and sent a copy-and-patch, so only the newly revealed rows are dirty. The renderer skips the unchanged cells and recomputes vertices for just the patch.

Congestion control

The naive approach is to send frames as fast as the server can produce them. On a fast network this works fine. On a slow one, frames pile up in buffers faster than the client can drain them, and latency spikes. Blit runs a per-client congestion controller inspired by BBR.

The first rule is to pace to the display, not the pipe. The client reports its refresh rate, and there’s no point pushing frames faster than the screen can show them. Bandwidth spent on frames the monitor will never display is bandwidth that could go to the next visible update.

Accurate RTT estimates let the controller push frames right up to the limit of what the client can display without queueing. The browser ACKs a frame only after it has rendered to the screen, not when it arrives, so the server’s latency estimate includes decompression, diff application, and GPU paint.

To handle even the spikiest mobile connection, bandwidth estimates use goodput over sliding windows rather than instantaneous rates. A single fast burst doesn’t mean the link is fast. If the server ramps up on a spike, it queues frames when throughput drops back down. Sliding windows smooth that out and let the server budget conservatively when the connection is unstable.

In a TCP-based delivery path, packets don’t get lost, they get queued. The traditional signal to back off, packet loss, never fires. Instead the controller watches queue delay: the frame window grows when delay is low and stable, and backs off when delay rises. Queue depth is what the user feels as latency.

The active terminal gets priority. Background terminals share whatever bandwidth is left over at a lower frame rate, so they never compete with the one the user is typing in.

Don’t know where to start? Try Code Review free for 60 days