How It Works

The mechanics: dual audio capture, real-time speech-to-text, latency, and how answer suggestions are generated.

This cluster is for people who want to understand the pipeline before they trust their interview to it. Reasonable.

End-to-end, an answer suggestion takes four steps: capture, transcribe, generate, render. Capture is OS-native — ScreenCaptureKit (macOS) or WASAPI (Windows) — pulling system audio at the OS level so the AI hears the interviewer the way your speakers do. The microphone is captured separately so the AI also has your audio for context and for the post-interview transcript. Transcription is real-time speech-to-text. Generation passes the question plus your resume, the job description, and the conversation history so far to GPT-4o, with a system prompt that constrains output to interview-appropriate length. Rendering streams the answer into a floating overlay window that exists outside the conferencing app's window — you can drag it anywhere, including off the screen-share area.

The end-to-end first-token latency budget is sub-400 milliseconds. Past that point your eyes shift off-camera while you read the answer, which defeats the purpose. The answers below cover each stage in detail, what happens when the budget is exceeded, and the trade-offs we picked. (For the deeper why-we-built-it context, see the founder letter.)

← All answer topics