How It Works
When you create a Speech Pipe on Unpod, it is automatically available as a WebSocket endpoint. Any client - browser, mobile app, desktop, IoT device - can connect to that endpoint and have a real-time voice conversation. Your agent code does not change. Speech processing (STT, VAD, barge-in, endpointing, TTS) all happens on Unpod’s side. Your AgentRunner just receives text and returns text, exactly as it does for phone calls.Connecting a Client
Unpod exposes a standard WebSocket URL per Speech Pipe. Your client connects with a short-lived token:Generating a Session Token
Your backend generates a short-lived token for each user session. Pass it to the client - never expose your API key on the frontend.UnpodSession.
Same Agent, Multiple Entry Points
Your AgentRunner accepts sessions from all sources on the sameagent_id. You do not need separate agents for phone calls and web/app clients.
CallContext tells you how the session arrived:
What Unpod Handles
Everything audio-related is managed by Unpod on both phone and WebSocket sessions:| Capability | Phone | WebSocket |
|---|---|---|
| STT (transcription) | Yes | Yes |
| VAD (turn detection) | Yes | Yes |
| Barge-in detection | Yes | Yes |
| TTS (synthesis) | Yes | Yes |
| Recording | Yes | Yes |
| Transcript storage | Yes | Yes |
Media binding is handled entirely on Unpod’s side. A media worker joins the
LiveKit room for the session and bridges the caller’s audio; your dialog brain
connects over a separate text-only channel and never touches the audio stream
or any SDP negotiation. This is why the same agent code runs unchanged across
phone and WebSocket sessions.
Next Steps
SDK Setup
AgentRunner reference - capacity, env vars, graceful shutdown.
Session Controls
say(), transfer, end, and hooks that work for all session types.