@pete

How I built Ensemble

This was the first time I’ve ever built a text editor from scratch and this post attempts to outline my learnings. The final product, ensemblewriter.com, launched in March. Before getting into that, some background. I’ve been writing a lot of screenplays for film recently. Specifically, for short films (for now). Over the past year, I’ve attempted to fall in love with the following software:

  • Celtx and Final Draft: I paid for each of these (not at the same time) and had pretty high hopes that my journey of creative procrastination could end but was let down. Too much bloat. Too slow. Too expensive. They clearly are winning due to just being around the longest.

  • Highland Pro: This editor has the very best UI and UX in my experience. However, it was non-cloud based so I was constantly relying on Google Cloud for file syncing.

  • WriterDuet: This one has the multi-user support and is cloud based, but the UX was lacking and I had quite a few performance issue. Especially while collaborating.

  • Fountain in vim: Gets too unwieldy at any sort of length.

So about a year ago I set out to build something that I would enjoy spending hours of time writing in.

The Design

I tried several variations, mostly leaning heavily native DOM elements, but ultimately landed on the following architecture:

flowchart TB
    subgraph ClientA["Browser — User A"]
        direction TB
        EA["Lightweight JS Editor<br/>input · UI · event loop"]
        WA["Rust / WASM Module<br/>heavy compute:<br/>layout · diff · geometry"]
        CA["HTML5 Canvas<br/>rendered output"]
        EA -->|"edit ops + state"| WA
        WA -->|"draw commands /<br/>pixel buffer"| CA
        EA -.->|"UI overlays<br/>(cursors, selection)"| CA
    end
 
    subgraph ClientB["Browser — User B"]
        direction TB
        EB["JS Editor"]
        WB["Rust / WASM"]
        CB["HTML5 Canvas"]
        EB -->|"edit ops"| WB
        WB -->|"draw commands"| CB
    end
 
    subgraph Server["Go Server"]
        direction TB
        WS["WebSocket Hub<br/>per-client connections"]
        REC["Reconciliation Engine<br/>OT merge<br/>conflict resolution"]
        DOC[("Authoritative<br/>Document State")]
        WS <--> REC
        REC <--> DOC
    end
 
    EA <-->|"WebSocket<br/>local ops ↕ remote ops"| WS
    EB <-->|"WebSocket<br/>local ops ↕ remote ops"| WS
 
    classDef client fill:#e8f0fe,stroke:#4285f4,color:#1a1a1a;
    classDef wasm fill:#fde7e7,stroke:#dea584,color:#1a1a1a;
    classDef canvas fill:#e6f4ea,stroke:#34a853,color:#1a1a1a;
    classDef server fill:#fef7e0,stroke:#fbbc04,color:#1a1a1a;
 
    class EA,EB client;
    class WA,WB wasm;
    class CA,CB canvas;
    class WS,REC,DOC server;
 
  • Editor: The editor stays thin, just handling input, UI state, and the event loop. It hands raw edit operations to the Rust/WASM module, which does the expensive work (layout, diffing, geometry).

  • WASM: WASM emits draw commands or writes directly to a pixel buffer that the HTML5 canvas renders. The dotted line shows the editor drawing lightweight UI overlays (remote cursors, selections) directly without round-tripping through WASM.

  • Editor <> Go server (WebSocket): Each client streams its local ops up and receives remote ops down. The Go server's hub fans these out, the reconciliation engine merges them (OT), and an authoritative document state keeps everyone convergent.

I decided to implement an AST to represent the screenplay document, which especially helps when importing and exporting to different formats.

The screenplay document is a flat sequence of typed blocks rather than a deep tree. A screenplay has very little real nesting. The only thing that wants structure is a dialogue block (character cue - > optional parenthetical - > dialogue lines, sometimes with extensions like (V.O.) or (CONT'D)). Everything else is a sibling: scene heading, action, transition, shot.

pub struct Block {
    pub id: BlockId,
    pub kind: BlockKind,
    pub text: String,
    pub children: Option<Vec<DialoguePart>>,
    pub character_id: Option<String>,
    pub byte_offset: usize,
    pub is_override: bool,
}

pub enum BlockKind {
    SceneHeading, Action, Transition,
    DialogueBlock, Shot, DualDialogue,
}

Some things worth calling out:

  • Byte offsets, not character offsets. Every cursor position, selection anchor, and block start is a usize byte offset into a flat String. This means O(1) seeks but every mutation entry point has to snap to a UTF-8 char boundary. Luckily, it’s not common for screenplays to use non UTF-8 chars.

  • Format overrides are sparse, keyed by byte offset. Most blocks are auto-detected by the parser, but a user can force "this paragraph is a Character cue, not Action." Instead of materializing that into a separate field on every block, overrides live in a BTreeMap<usize, ElementKind> on the editor and are folded in at parse time. Round-tripping through export/import is a simple extract_overrides() pass.

  • The Document is derived state. The source of truth inside Editor is the raw text: String plus the override map. export_document() re-parses on demand. This keeps the edit path stupid simple: a keystroke mutates a single string and bumps a version counter.

Element detection

The parser is heuristic, not grammar-based, because Fountain (the de-facto screenplay convention) is itself heuristic. The rules:

  • Scene heading: line starts with INT., EXT., INT./EXT., or I/E. (case-insensitive).

  • Transition: all-uppercase line ending in :.

  • Character cue: all-uppercase, no trailing . or ,, and not one of the action directives INSERT, MONTAGE, FLASHBACK, etc.). Character extensions like (V.O.) and (CONT'D) are stripped before the uppercase check so BOB (V.O.) still classifies as a cue.

  • Parenthetical: starts (, ends ).

  • Dialogue: anything that follows a cue/parenthetical inside a dialogue block.

  • Action: fallback, basically.

Detection runs over the flat text in one linear pass, producing a Vec<Element> with byte ranges. That vector is the input to layout.

Typing each element type from an empty buffer and watching the AST panel classify them in real time.

Rendering

The Rust core never touches the canvas (or any other dom element). Instead, Editor::render() produces a RenderFrame:

pub enum RenderCommand {
    Clear { color },
    FillRect { rect, color },
    DrawText { x, y, text, font_size, color, bold },
    DrawCursor { x, y, height, color },
    DrawSelection { rect, color },
}

pub struct RenderFrame {
    pub commands: Vec<RenderCommand>,
    pub current_element: Option<String>,
    pub page_count: u32,
    pub current_page: u32,
    pub at_page_limit: bool,
    // ...
}

RenderFrame is Serialize on the Rust side and gets handed to JS as a plain JS object via serde_wasm_bindgen. The JS-side drawFrame() walks the commands and translates each into one or two 2D canvas calls. Off-screen commands (more than 20px outside the viewport) are dropped before they hit the canvas. Basically an optimistic loader.

This design has two important properties:

  1. No shared state across the boundary. WASM doesn't hold a canvas handle; JS doesn't hold a document handle. The only thing crossing is a serialized event in and a serialized frame out. This was mostly intentional to make testing simple and fast.

  2. The JS side can layer its own overlays. Remote cursors, selections, search highlights, and spell-check underlines are all drawn in JS on top of the frame WASM produced. To position a remote cursor I just ask WASM to translate that user's byte offset to a pixel coordinate offset_to_pixel), then fillRect it from JS. WASM never has to know about other users.

The WASM Bridge

WasmEditor is the only thing exposed in the binary:

#[wasm_bindgen]
impl WasmEditor {
    pub fn handle_event(&mut self, event: JsValue) -> Result<JsValue, JsValue> {
        let input: InputEvent = serde_wasm_bindgen::from_value(event)?;
        let frame = self.inner.process_event(input);
        Ok(serde_wasm_bindgen::to_value(&frame)?)
    }
    pub fn render(&self) -> Result<JsValue, JsValue> { /* ... */ }
    pub fn take_ops(&mut self) -> Result<JsValue, JsValue> { /* ... */ }
    pub fn apply_remote_ops(&mut self, ops: JsValue) -> Result<(), JsValue> { /* ... */ }
    pub fn export_document(&self) -> Result<JsValue, JsValue> { /* ... */ }
    pub fn import_document(&mut self, doc: JsValue) -> Result<(), JsValue> { /* ... */ }
}

JS events are typed enums on the Rust side:

  • KeyDown { key, ctrl, shift, alt, meta }

  • MouseDown { x, y, button, click_count }

  • Scroll { delta_x, delta_y }

  • Paste { text }

The wire format is whatever serde_wasm_bindgen does with #[derive(Serialize, Deserialize)] - plain JS objects with snake_case keys it seems.

The event loop on the JS side has no requestAnimationFrame polling. Each input event synchronously calls handle_event, gets a RenderFrame back, and draws it. The only RAF in the system is for mobile inertial scroll momentum. Cursor blink is a 530ms setInterval that re-paints the cached last frame with the cursor toggled. It doesn't re-enter WASM.

Layout caching

I learned that the expensive thing isn't parsing. It's layout. For a 120-page screenplay, parsing is ~36µs but a full layout pass (wrap text to column width, apply spacing rules between scene headings and transitions, walk page boundaries and inject (MORE) / CHARACTER (CONT'D) continuation rows) is the slow part.

A keystroke has to relayout. A cursor move, a scroll, a mouse hover, a window focus… none of those change the layout. So the editor keeps a version counter that's bumped only on text/override/width changes:

fn build_visual_map_with_elems(&self) -> (VisualLineMap, Vec<Element>) {
    let v = self.layout_inputs_version;
    if let Some(cached) = self.layout_cache.borrow().as_ref() {
        if cached.version == v {
            return (cached.map.clone(), cached.elems.clone());
        }
    }
    // ... compute and cache ...
}

The numbers on a 120-page document (M5 Pro, rustc 1.95):

Operation

Time

render (cache hit)

37 µs

cursor move down + up

277 µs

keystroke (insert + backspace, forces relayout)

19.9 ms

So a cursor move is ~100× cheaper than a keystroke, and a pure re-render is ~500× cheaper. That gap is what makes scrolling and cursor blink free.

Holding ArrowDown through a 120-page document. Sparkline drops to the floor because no event in the stream invalidates the layout cache.

Sustained typing into the same 120-page document. Every keystroke forces a relayout, but each one still lands well under the 16ms frame budget.

Cold load: synthesize a 120+ page screenplay from concatenated samples and parse it in one shot.

Real-time Collaboration

The Go server runs a hub-room-client topology. The Hub holds a map[string]*Room keyed by project ID; each Room owns the authoritative text, a monotonic revision: uint64, a capped operation history (1000 entries), and the set of connected clients.

Connection lifecycle:

  1. Client authenticates via session cookie, the handler attaches it to a Room (creating it if needed, the room lazily loads text + revision from MongoDB).

  2. The client receives a Snapshot message: full text, current revision, list of other connected peers.

  3. From then on, it streams Ops { base_revision, ops, client_seq } up and receives RemoteOps and Ack messages down.

Operations are just two variants:

type Op struct {
    Type   string // "insert" | "delete"
    Offset int
    Text   string // for insert
    Length int    // for delete
}

The Transform(a, b Op) Op function handles all four pair combinations (ins/ins, ins/del, del/ins, del/del). The interesting ones:

  • Insert vs Insert at the same offset: server wins (the one already in history). The incoming op shifts right by the existing insert's length.

  • Insert vs Delete: if the insert falls inside the deletion range, it collapses to the deletion's start offset.

  • Delete vs Delete with overlap: compute the overlap region, shrink the incoming delete's length by the overlap, and adjust its offset by the prefix that was already consumed.

When ops arrive from a client with BaseRevision = N, the server transforms them iteratively against every history entry from N+1 to current, applies them to room.text, bumps the revision, appends to history, then broadcasts the transformed result to all other clients in the room.

On the client side, collab.js runs a three-state machine:

  • SYNCHRONIZED: nothing in flight, send immediately

  • AWAITING_ACK: sent ops, server hasn't acked; new local ops get buffered

  • AWAITING_ACK_WITH_BUFFER: when the ack comes, transform the buffer against the just-acked revision and send

When a RemoteOps message lands while we're still waiting on an ack, three things happen in order:

  1. Transform our inflight ops against the remote ops.

  2. Transform any buffered local ops against the remote ops too.

  3. Apply the remote ops to the local WASM editor via apply_remote_ops().

That's the bidirectional transformPair loop. Once it finishes, the local state matches the server's, and any pending local work has been rebased onto the new baseline so it'll apply cleanly the next time we get a send window.

Convergence falls out of three properties stacked together:

  • Both client and server run the same Transform function.

  • The server's monotonic revision numbers give every operation a total order.

  • Transform(a, b) is commutative over the pair: Running a then b' lands in the same place as running b then a'.

Stack those and every client ends up at the same document, regardless of the order their packets arrived in.

Heartbeats and reconnect

WebSocket pings every 54 seconds. A missed pong for 60 seconds tears the connection down. The client reconnects with exponential backoff capped at 30 seconds, and the server sends a fresh Snapshot to bring it back up to date. No replay-from-revision protocol, because re-sending the full document is cheaper than keeping a per-client cursor into the history buffer.

Presence is out of band

Remote cursor positions don't go through OT. They're sent as separate Presence { byte_offset, anchor } messages and broadcast as RemotePresence. The server doesn't persist them or care about ordering. The client receives them, stores them in a Map<clientId, { offset, anchor, color, name }>, and on every local render asks WASM to translate the byte offset to pixel coords before drawing the ghost cursor in JS. Decoupling presence from the document op stream means a chatty cursor doesn't churn the OT path, and presence loss is harmless, it'll re-arrive on the next mouse move.

Closing

There's a Stripe-backed Pro tier sitting alongside the editor which includes AI-generated shot lists and a handful of other filmmaking-oriented features, but it's a separate concern from the client app I've been writing about, so I've kept it out of scope here.

A year in, the honest takeaway: building a custom text editor in Rust/WASM is one of the more rewarding things I've worked on, but it's hard to justify unless your document has structure that off-the-shelf editors actively fight against.

Screenplays do. Fixed column widths, heuristic element detection, page-break fixup with (MORE) / (CONT'D). If your product lives comfortably inside a contenteditable div or a Monaco instance, build it there. Reach for this kind of architecture only when the format is the product.