SIP Telephony — Zero to Hero v1 · 2026-04-25
PART 05 · STEPS 61–75

Bridge: Asterisk ↔ Freya Voice Agent

The integration core: a single WebSocket carries µ-law audio between Asterisk's chan_websocket channel driver and the pipecat-agent pipeline. Once you can read this connection, every inbound and outbound Freya call becomes legible.

15 steps 3 demos ~22 minutes
61/telephony/ws

The /telephony/ws endpoint

The Freya voice agent (pipecat-agent) exposes ws://<host>:7860/telephony/ws. Asterisk's chan_websocket connects here for every call. Inside the agent, this socket is plugged directly into the Pipecat pipeline.

  • Accepts a WebSocket from Asterisk's chan_websocket.
  • Receives JSON-framed audio chunks from the caller side.
  • Pushes JSON-framed TTS audio back to Asterisk.
Architecture — voice-agent /telephony/ws
freya-asterisk chan_websocket Dial(WebSocket/ai_media) network_mode: host ws://host:7860 /telephony/ws pipecat-agent · :7860 WS in STT LLM TTS WS out caller voice → text → reply → caller voice Pipecat frame pipeline, full-duplex
A single WebSocket per call, both directions multiplexed in JSON envelopes.
62chan_websocket

chan_websocket — what it really is

It is a Sangoma-maintained Asterisk module that makes a WebSocket destination look like any other channel to the dialplan. From Dial()'s perspective it's the same as PJSIP, IAX2, or Local — different transport, same channel surface.

  • One channel = one WebSocket connection.
  • Can be the destination of Dial(WebSocket/...).
  • Can be Originate-d for outbound flows.
  • Carries audio frames as binary or wrapped in JSON envelopes (configurable).

In our config we use JSON frames (f(json) in the Dial string), which makes parsing trivial on the agent side and lets us multiplex control messages alongside media.

63µ-law frames

G.711 µ-law frame format

Audio leaves Asterisk in 20 ms chunks of µ-law (8 kHz, 1 byte per sample = 160 bytes per chunk, 50 chunks/s). The agent sees:

{"type": "media", "audio": "<base64-160-bytes>", "ts": 12345}

Sizes worth memorising:

  • 20 ms × 8000 samples/s × 1 byte/sample = 160 bytes per frame.
  • 1 byte per sample because µ-law is logarithmically compressed 16-bit PCM.
  • Base64 expansion: 160 raw bytes → ~216 bytes wire payload.
64JSON control

JSON control messages

Alongside media frames, the protocol multiplexes control envelopes. Knowing them by sight is half the debugging battle.

typedirectionmeaning
startAsterisk → agentCall begins. Carries call-id, codec, direction.
mediaboth20 ms µ-law audio chunk, base64-encoded.
dtmfAsterisk → agentDTMF digit. Rare — we mostly carry DTMF in-band.
clearagent → AsteriskFlush buffered TTS audio (interruption).
stopbothCall ended.
Demo · WebSocket frame stream
A single call's wire trace, simplified. Blue = media, purple = control, red = end.
65direction

Audio direction

Bidirectional, full-duplex on the same WebSocket.

  • Asterisk → agent: caller's voice. Goes to STT.
  • Agent → Asterisk: TTS output. Routed back to the caller via RTP.

Both legs share frame numbering (ts) so the agent can interrupt itself cleanly with a clear when the user starts talking over the bot.

66c(ulaw)f(json)

The c(ulaw)f(json) codec/format spec

Inside Dial(WebSocket/ai_media/c(ulaw)f(json)) the bracketed flags pin both the codec and the framing.

  • c(ulaw) — Asterisk transcodes any incoming codec to µ-law before sending over WS, and accepts µ-law back.
  • f(json) — frames are wrapped in JSON envelopes, not raw binary.

If the trunk side is alaw, Asterisk transcodes once on entry and once on exit. Cheap on modern CPUs but it counts in latency budgets.

Diagram · Envelope around 160 µ-law bytes
f(json) — outer wrapper
{ "type": "media", "audio": "<...>", "ts": 12345 }
audio field — base64 of payload
"ZnJleWE6IG11LWxhdyBmcmFtZSAtLSAxNjAgYnl0ZXMgaW5zaWRlIQ=="
c(ulaw) — 160 bytes, 20 ms @ 8 kHz
[ 0xFF 0x7F 0x80 ... 160 bytes total ... 0x80 0x7E 0xFF ]
Three nested layers. Strip the outer JSON, base64-decode the audio, you get pure G.711 µ-law.
67MixMonitor

Call recording with MixMonitor

exten => _X.,n,MixMonitor(${UNIQUEID}-mixed.wav,b)

MixMonitor taps the audio bridge and writes a WAV containing both directions. Variants:

  • MixMonitor(file.wav) — mixed only.
  • MixMonitor(file.wav,r(read.wav)t(write.wav)) — separate caller/agent tracks.

Our deployments record the mix and upload via a hangup handler (upload-recording.sh). The dashboard exposes per-track downloads via ?track=user|assistant when separate tracks are available.

68inbound flow

Inbound call flow end-to-end

One frame's journey, from PSTN handset to LLM and back. Each arrow is a place a packet can get lost — RTP, transcode, WebSocket, STT buffer, TTS pacing, RTP again.

Demo · End-to-end inbound flow
Direction:
PSTN handset caller Customer SBC SIP+RTP freya-asterisk chan_websocket /telephony/ws JSON µ-law pipecat-agent pipeline router STT (Whisper) audio → text LLM text → text TTS (Spark) text → audio RTP back to caller frame in flight
Press Step or Play. Toggle direction to invert flow.
Caller direction (left → right) goes through STT/LLM/TTS. Agent direction (right → left) is just synthesized audio replayed back to the caller.
PSTN/Customer SBC
   |  SIP INVITE  (UDP/TCP, port 5060)
   v
freya-asterisk   // pjsip.conf identifies caller by IP, routes to from-trunk
   |  Dial(WebSocket/ai_media/c(ulaw)f(json))
   v
pipecat-agent (port 7860 /telephony/ws)
   |  audio frames → Whisper STT
   |  text → LLM → TTS (Spark)
   |  audio frames back to Asterisk
   v
freya-asterisk → RTP → SBC → PSTN
69outbound flow

Outbound call flow

Outbound is initiated by the campaign-worker over ARI. Asterisk talks SIP to the trunk; once the carrier returns 200 OK, the channel goes Up and is bridged to a fresh WebSocket leg into pipecat-agent.

Demo · Outbound — ARI control + SIP signaling
ARI (HTTP/WS)
SIP signalling
Two synchronised ladders: ARI carries control + channel events; SIP carries the actual call setup to the carrier.

The KKB campaign we analysed lives in this exact path. The 603 came from trunk → PSTN (carrier-side rejection), so the SIP ladder fails before the ARI ladder ever sees a ChannelStateChange: Up.

70ARI vs AMI

Originate via ARI vs AMI

Two ways to ask Asterisk to place a call from outside.

InterfaceTransportStyleWhen we use it
ARIHTTP + WSJSON, RESTful Default. campaign-worker is HTTP-native; trivial to call from any service.
AMITCP / 5038Line-based actions Legacy integrations. Same outcome, older protocol.

Sample ARI Originate body:

POST /ari/channels
{
  "endpoint":  "PJSIP/providers/sip:+905374705251@93.180.132.170",
  "extension": "s",
  "context":   "from-trunk-outbound",
  "callerId":  "908502427127",
  "variables": { "X-Freya-Direction": "outbound",
                 "X-Freya-Call-Id":   "..." }
}
71where it runs

Where pipecat-agent runs

In our docker-compose, the agent is host-networked alongside Asterisk:

voice-agent:
  image: ...freya-onprem/voice-agent:latest
  network_mode: host         # uses host's network stack directly
  ports:
    - 7860                   # advertised port
  ...

network_mode: host is required because the agent needs to talk RTP to coturn and the SBC without docker-bridge NAT in the way. Asterisk runs the same way for the same reason — anything that touches RTP must avoid double-NATting.

72NC-Opt

NC-Opt — the noise cancellation service

A separate WebSocket service at ws://nc-service:8005. The agent can stream raw mic audio to NC-Opt, get denoised audio back, then forward the cleaned-up frames to STT.

  • Optional but improves accuracy on noisy lines (mobile callers, car cabins).
  • GPU service, one GPU dedicated.
  • Configurable RTF (real-time factor) and concurrency.
Try this on KKB
$ docker logs -f nc-opt 2>&1 | grep -E 'rtf|frames|err'
73recording upload

Recording upload — the hangup handler

exten => h,1,System(/usr/local/bin/upload-recording.sh ${UNIQUEID} prod)

h is the hangup pseudo-extension; it runs after the call ends. The script reads the local WAV, gzips it, and uploads to S3 (or MinIO on-prem). The dashboard's call-detail view fetches it via signed URL when the engineer opens the call.

If a recording is missing, suspect three things: the channel never made it to a bridge (no MixMonitor target), the script returned non-zero (check syslog), or the bucket credentials expired.

74Stasis

Stasis dialplan apps and ARI events

If you need full programmatic control — record + replay + interrupt + transfer + DTMF capture, all driven from a single process — you write an ARI app:

exten => _X.,1,Stasis(freya-app)

The ARI WebSocket client (a separate HTTP service) receives StasisStart for that channel and dictates everything next: which sounds to play, when to bridge, when to dial out for transfer.

We do not currently use Stasis for production calls; the dialplan + chan_websocket pair is enough. Stasis is on the roadmap for transfer-to-human flows where we need the agent and a human leg in the same bridge mid-call.

75lifecycle in logs

Call lifecycle visible in logs

Trace a single call by Call-ID in freya-asterisk logs. The expected sequence is short and rigid — any deviation is your debugging entry point.

1SIP/2.0 100 Tryingwe send to peer
2SIP/2.0 200 OKwe receive — call answered
3ACK ...we send — completes 3-way handshake
4Channel PJSIP/... joined 'simple_bridge'caller leg in bridge
5Channel WebSocket/ai_media-... joined 'simple_bridge'agent leg in bridge — audio flows
6Channel WebSocket/ai_media-... left 'simple_bridge'agent leg removed (BYE)
7Channel PJSIP/... left 'simple_bridge'caller leg removed
8End MixMonitor Recording PJSIP/...recording closed and ready to upload

Missing line 3 → no ACK → call drops at ~60 s. Missing line 5 → WS handshake failed. Missing line 8 → check the hangup handler.

Try this on KKB
$ docker logs freya-asterisk 2>&1 | grep -F "<UNIQUEID>" | sort
Checkpoint 5

A test call connects, the dashboard shows "in progress", but the user hears silence and the agent transcript shows nothing. Where does the audio path break: Asterisk inbound RTP, the WebSocket to the agent, or the agent's STT? How would you isolate which?

Show answer

Walk the path in three checkable hops, each with one log signal:

  1. Asterisk inbound RTP. Run asterisk -rx "rtp set debug on" for the call. If you see no Got RTP packet from ... from the SBC's IP, RTP never arrived — firewall or NAT. The MixMonitor file will be silent on the read track.
  2. WebSocket to the agent. In freya-asterisk logs, look for Channel WebSocket/ai_media-... joined 'simple_bridge'. If absent, the WS handshake failed (wrong host, wrong port, agent down). If present but audio is silent, check the agent side: docker logs voice-agent | grep media — no media frames means Asterisk isn't pushing audio over the WS even though the channel is up (transcoder issue or codec mismatch).
  3. Agent STT. If the agent is seeing media frames but transcript is empty, STT itself is the problem: NC-Opt unhealthy, Whisper service down, or the audio is being decoded as the wrong codec. Check nc-opt and STT container logs.

The fastest single command: tcpdump -ni any 'udp portrange 10000-20000' on the Asterisk host — if you see RTP both ways, the problem is past Asterisk; if only one direction, the problem is in front of Asterisk.