The /telephony/ws endpoint
The Freya voice agent (pipecat-agent) exposes ws://<host>:7860/telephony/ws. Asterisk's chan_websocket connects here for every call. Inside the agent, this socket is plugged directly into the Pipecat pipeline.
- Accepts a WebSocket from Asterisk's
chan_websocket. - Receives JSON-framed audio chunks from the caller side.
- Pushes JSON-framed TTS audio back to Asterisk.
chan_websocket — what it really is
It is a Sangoma-maintained Asterisk module that makes a WebSocket destination look like any other channel to the dialplan. From Dial()'s perspective it's the same as PJSIP, IAX2, or Local — different transport, same channel surface.
- One channel = one WebSocket connection.
- Can be the destination of
Dial(WebSocket/...). - Can be
Originate-d for outbound flows. - Carries audio frames as binary or wrapped in JSON envelopes (configurable).
In our config we use JSON frames (f(json) in the Dial string), which makes parsing trivial on the agent side and lets us multiplex control messages alongside media.
G.711 µ-law frame format
Audio leaves Asterisk in 20 ms chunks of µ-law (8 kHz, 1 byte per sample = 160 bytes per chunk, 50 chunks/s). The agent sees:
{"type": "media", "audio": "<base64-160-bytes>", "ts": 12345}
Sizes worth memorising:
- 20 ms × 8000 samples/s × 1 byte/sample = 160 bytes per frame.
- 1 byte per sample because µ-law is logarithmically compressed 16-bit PCM.
- Base64 expansion: 160 raw bytes → ~216 bytes wire payload.
JSON control messages
Alongside media frames, the protocol multiplexes control envelopes. Knowing them by sight is half the debugging battle.
| type | direction | meaning |
|---|---|---|
start | Asterisk → agent | Call begins. Carries call-id, codec, direction. |
media | both | 20 ms µ-law audio chunk, base64-encoded. |
dtmf | Asterisk → agent | DTMF digit. Rare — we mostly carry DTMF in-band. |
clear | agent → Asterisk | Flush buffered TTS audio (interruption). |
stop | both | Call ended. |
Audio direction
Bidirectional, full-duplex on the same WebSocket.
- Asterisk → agent: caller's voice. Goes to STT.
- Agent → Asterisk: TTS output. Routed back to the caller via RTP.
Both legs share frame numbering (ts) so the agent can interrupt itself cleanly with a clear when the user starts talking over the bot.
The c(ulaw)f(json) codec/format spec
Inside Dial(WebSocket/ai_media/c(ulaw)f(json)) the bracketed flags pin both the codec and the framing.
c(ulaw)— Asterisk transcodes any incoming codec to µ-law before sending over WS, and accepts µ-law back.f(json)— frames are wrapped in JSON envelopes, not raw binary.
If the trunk side is alaw, Asterisk transcodes once on entry and once on exit. Cheap on modern CPUs but it counts in latency budgets.
Call recording with MixMonitor
exten => _X.,n,MixMonitor(${UNIQUEID}-mixed.wav,b)
MixMonitor taps the audio bridge and writes a WAV containing both directions. Variants:
MixMonitor(file.wav)— mixed only.MixMonitor(file.wav,r(read.wav)t(write.wav))— separate caller/agent tracks.
Our deployments record the mix and upload via a hangup handler (upload-recording.sh). The dashboard exposes per-track downloads via ?track=user|assistant when separate tracks are available.
Inbound call flow end-to-end
One frame's journey, from PSTN handset to LLM and back. Each arrow is a place a packet can get lost — RTP, transcode, WebSocket, STT buffer, TTS pacing, RTP again.
PSTN/Customer SBC
| SIP INVITE (UDP/TCP, port 5060)
v
freya-asterisk // pjsip.conf identifies caller by IP, routes to from-trunk
| Dial(WebSocket/ai_media/c(ulaw)f(json))
v
pipecat-agent (port 7860 /telephony/ws)
| audio frames → Whisper STT
| text → LLM → TTS (Spark)
| audio frames back to Asterisk
v
freya-asterisk → RTP → SBC → PSTN
Outbound call flow
Outbound is initiated by the campaign-worker over ARI. Asterisk talks SIP to the trunk; once the carrier returns 200 OK, the channel goes Up and is bridged to a fresh WebSocket leg into pipecat-agent.
The KKB campaign we analysed lives in this exact path. The 603 came from trunk → PSTN (carrier-side rejection), so the SIP ladder fails before the ARI ladder ever sees a ChannelStateChange: Up.
Originate via ARI vs AMI
Two ways to ask Asterisk to place a call from outside.
| Interface | Transport | Style | When we use it |
|---|---|---|---|
| ARI | HTTP + WS | JSON, RESTful | Default. campaign-worker is HTTP-native; trivial to call from any service. |
| AMI | TCP / 5038 | Line-based actions | Legacy integrations. Same outcome, older protocol. |
Sample ARI Originate body:
POST /ari/channels
{
"endpoint": "PJSIP/providers/sip:+905374705251@93.180.132.170",
"extension": "s",
"context": "from-trunk-outbound",
"callerId": "908502427127",
"variables": { "X-Freya-Direction": "outbound",
"X-Freya-Call-Id": "..." }
}
Where pipecat-agent runs
In our docker-compose, the agent is host-networked alongside Asterisk:
voice-agent: image: ...freya-onprem/voice-agent:latest network_mode: host # uses host's network stack directly ports: - 7860 # advertised port ...
network_mode: host is required because the agent needs to talk RTP to coturn and the SBC without docker-bridge NAT in the way. Asterisk runs the same way for the same reason — anything that touches RTP must avoid double-NATting.
NC-Opt — the noise cancellation service
A separate WebSocket service at ws://nc-service:8005. The agent can stream raw mic audio to NC-Opt, get denoised audio back, then forward the cleaned-up frames to STT.
- Optional but improves accuracy on noisy lines (mobile callers, car cabins).
- GPU service, one GPU dedicated.
- Configurable RTF (real-time factor) and concurrency.
$ docker logs -f nc-opt 2>&1 | grep -E 'rtf|frames|err'
Recording upload — the hangup handler
exten => h,1,System(/usr/local/bin/upload-recording.sh ${UNIQUEID} prod)
h is the hangup pseudo-extension; it runs after the call ends. The script reads the local WAV, gzips it, and uploads to S3 (or MinIO on-prem). The dashboard's call-detail view fetches it via signed URL when the engineer opens the call.
If a recording is missing, suspect three things: the channel never made it to a bridge (no MixMonitor target), the script returned non-zero (check syslog), or the bucket credentials expired.
Stasis dialplan apps and ARI events
If you need full programmatic control — record + replay + interrupt + transfer + DTMF capture, all driven from a single process — you write an ARI app:
exten => _X.,1,Stasis(freya-app)
The ARI WebSocket client (a separate HTTP service) receives StasisStart for that channel and dictates everything next: which sounds to play, when to bridge, when to dial out for transfer.
We do not currently use Stasis for production calls; the dialplan + chan_websocket pair is enough. Stasis is on the roadmap for transfer-to-human flows where we need the agent and a human leg in the same bridge mid-call.
Call lifecycle visible in logs
Trace a single call by Call-ID in freya-asterisk logs. The expected sequence is short and rigid — any deviation is your debugging entry point.
Missing line 3 → no ACK → call drops at ~60 s. Missing line 5 → WS handshake failed. Missing line 8 → check the hangup handler.
$ docker logs freya-asterisk 2>&1 | grep -F "<UNIQUEID>" | sort
A test call connects, the dashboard shows "in progress", but the user hears silence and the agent transcript shows nothing. Where does the audio path break: Asterisk inbound RTP, the WebSocket to the agent, or the agent's STT? How would you isolate which?
Show answer
Walk the path in three checkable hops, each with one log signal:
- Asterisk inbound RTP. Run
asterisk -rx "rtp set debug on"for the call. If you see noGot RTP packet from ...from the SBC's IP, RTP never arrived — firewall or NAT. The MixMonitor file will be silent on the read track. - WebSocket to the agent. In
freya-asterisklogs, look forChannel WebSocket/ai_media-... joined 'simple_bridge'. If absent, the WS handshake failed (wrong host, wrong port, agent down). If present but audio is silent, check the agent side:docker logs voice-agent | grep media— nomediaframes means Asterisk isn't pushing audio over the WS even though the channel is up (transcoder issue or codec mismatch). - Agent STT. If the agent is seeing
mediaframes but transcript is empty, STT itself is the problem: NC-Opt unhealthy, Whisper service down, or the audio is being decoded as the wrong codec. Checknc-optand STT container logs.
The fastest single command: tcpdump -ni any 'udp portrange 10000-20000' on the Asterisk host — if you see RTP both ways, the problem is past Asterisk; if only one direction, the problem is in front of Asterisk.