SIP Train Blog - Troubleshooting
The Top 10 SIP Failure Modes and the Fingerprint Each Leaves in a Trace
Every senior VoIP engineer eventually develops the same diagnostic instinct: open the trace, glance at the ladder, and know within thirty seconds which of a small handful of failure modes you are dealing with. The trace tells you. The numbers on the responses, the timing of the retransmissions, and the few headers that are present, missing, or wrong narrow the diagnosis to one of about ten patterns long before you have read a single SDP body.
This post is the field guide. Ten failure modes that account for the overwhelming majority of "it doesn't work" tickets in production SIP, the precise trace fingerprint each leaves, and the RFC clause that explains the behaviour. The list is opinionated (there are more than ten failure modes in SIP) but these are the ten that we see again and again across carriers, hosted PBX operators, and enterprise voice platforms. If you can recognise these by sight, you have eliminated about 80% of the noise in the average post-mortem.
1. DNS NAPTR/SRV resolution failure
Symptom. The user clicks call. Nothing visible happens for ten to thirty seconds. Then a SIP 503 Service Unavailable or a hard timeout. The UAC's call counter never increments on the proxy.
Fingerprint. No INVITE on the wire from the UAC at all, or an INVITE sent to the wrong address family. A packet capture wider than SIP shows a NAPTR query for the destination domain, no answer, an SRV fallback, no answer, and an A/AAAA fallback that succeeds, often pointing at a stale or wrong host. The total elapsed time before any SIP traffic appears is five to fifteen seconds, dominated by DNS resolver retries.
Why. RFC 3263 defines a strict ordering for SIP server resolution: NAPTR first to discover transports, SRV second to discover the per-transport host/port pairs, and A/AAAA only as a last resort. When the NAPTR or SRV records are missing or stale, every UAC has to time-out the higher tiers before falling through to the next, and each timeout is on the order of seconds.
What to look for. Capture on UDP/53 alongside the SIP capture. If the SIP capture starts more than a couple of seconds after the user-initiated event and the destination domain is using a fully-qualified domain rather than an IP literal, NAPTR/SRV is the suspect. The fix is almost always provisioning, not protocol, but the engineer who diagnoses it from a SIP-only capture is going to chase ghosts for hours.
2. INVITE retransmission storm
Symptom. The call eventually fails, but the trace looks busy. Every INVITE is repeated multiple times, and so is every other request method.
Fingerprint. Identical INVITEs (same branch parameter, same Call-ID, same CSeq) at 0.5s, 1.5s, 3.5s, 7.5s, 15.5s, gaps doubling each time, and the transaction abandoning at the 32-second mark. No 1xx response from the downstream element. The signature is the geometric backoff defined for the INVITE Client transaction in RFC 3261 §17.1.1.2: Timer A starts at T1 (default 500ms) and doubles on each retransmission until Timer B (64×T1 = 32s) terminates the transaction. Note that Timer A does not cap at T2: the T2 cap applies to the non-INVITE Client (Timer E) and the INVITE Server (Timer G), not to INVITE Client retransmissions.
Why. The UAC has not received a provisional response, so it is following the standard transaction-layer retransmission timer. Either the next-hop element has not received the INVITE at all (transport-level: routing, firewall, NAT pinhole) or it has received it and is failing to send 100 Trying within T1.
What to look for. Compare the inter-retransmit gaps to the canonical Timer A doubling sequence: 0.5s, 1s, 2s, 4s, 8s, 16s. If the timing matches, the retransmission is correct UAC behaviour reacting to silence from the next hop. The fault is upstream of the UAC's transport, not in the UAC. A packet capture on the next-hop element will show whether the INVITE arrived. If it did, the next-hop element is failing to send 100 Trying, and a stateful proxy is expected by RFC 3261 §16.7 to send 100 Trying within 200ms to suppress these retransmissions, so the absence of any 1xx is itself a sign of a misbehaving proxy or a stateless one masquerading as stateful.
3. The 401/407 authentication challenge loop
Symptom. Registration or call setup never completes. The trace shows a polite back-and-forth between client and server, never resolving.
Fingerprint. The UAC sends REGISTER (or INVITE) without credentials.
The server returns 401 Unauthorized (registrar) or 407 Proxy Authentication Required
(proxy) with a WWW-Authenticate or Proxy-Authenticate
header carrying a fresh nonce. The UAC responds with credentials. The
server returns another 401/407 with another fresh nonce. The
UAC tries again. And again. The CSeq increments on every attempt; the Call-ID in
REGISTER stays the same.
Why. RFC 3261 §22 and the underlying digest scheme (RFC 7616,
formerly RFC 2617) compute the response hash from the username, password, realm,
nonce, URI, method, and an optional qop=auth body that mixes in the
cnonce and nc value. If any of those inputs is wrong on
the client side, most commonly the realm or the password, every retry produces a
hash the server cannot verify, and the server reissues a fresh challenge. The clue
is that the second challenge's nonce differs from the first.
What to look for. Look at the realm in the challenge
and the realm the UAC's digest response was computed against.
Mismatches happen routinely when an SBC inserts itself as the authenticating element
and the UAC has been provisioned against the upstream registrar's realm. Look also
at the algorithm (MD5, MD5-sess, SHA-256), qop, and
opaque parameters: a UAC that mishandles qop=auth will
fail every challenge silently. Two consecutive 401/407 with different nonces is the
unmistakable signal that the credentials, not the connectivity, are at fault.
4. 488 Not Acceptable Here: codec mismatch
Symptom. The call is rejected immediately. The user hears a busy or "call cannot be completed" indication.
Fingerprint. The UAS responds to INVITE with
488 Not Acceptable Here. There is no 180 Ringing, no 200 OK. The UAC's
outbound SDP offer carries one set of m=audio codecs (e.g.
RTP/AVP 9 0 8) and the answerer rejects the entire session.
Why. RFC 3261 §21.4.26 reserves 488 for the case where the answerer
has examined the offered SDP and finds no acceptable subset. RFC 3264 §6 then
defines a finer-grained per-stream rejection mechanism: a stream the answerer wishes
to decline is signalled by setting its m= line port to zero (e.g.
m=audio 0 RTP/AVP 0) in an answer where the rest of the session is
acceptable. So 488 means no part of the offer was usable; port-zero on a
single m= line is the right tool for "this one stream is unusable but
the rest of the session is fine".
What to look for. Read the offered SDP and compare it against what the answerer's policy permits, almost always a codec configuration mismatch. The common forensic trap is that the answerer in question is not the original UAS but an SBC enforcing a codec policy on behalf of the core. If the trace shows the SBC sourcing the 488, examine the SBC's codec filter; if the trace shows the UAS sourcing it, the codec list never reached anything that would have been mutually acceptable.
5. 423 Interval Too Brief
Symptom. A device's REGISTER fails immediately. The phone shows "registration failed" or "service unavailable". The user retries, with the same result.
Fingerprint. The UAC sends REGISTER with Expires: 60
(or some small value). The registrar returns 423 Interval Too Brief
with a Min-Expires: 3600 header. The UAC, depending on its
sophistication, either gives up or correctly retries with Expires: 3600
and succeeds.
Why. RFC 3261 §10.3 step 7 lets a registrar reject a REGISTER whose
requested expiration is below its minimum policy threshold and return the threshold
via the Min-Expires header field (defined in §20.23). A
standards-compliant UAC adapts and retries; a poorly-implemented UAC reads the 4xx
as a hard fail.
What to look for. A single 423 in a registration trace is almost
never a fault: it is the registrar negotiating up. The fault is when the UAC does
not retry with the supplied Min-Expires value. If the trace shows the
UAC retrying its original Expires: 60 and getting a second 423, the UAC
is broken. Devices that misbehave here are often older hardware phones with
hardcoded refresh intervals or test scripts that ignore response headers.
6. 483 Too Many Hops: the routing loop
Symptom. The call fails after a brief pause. The trace looks short and undramatic. The originating UAC sees a 483.
Fingerprint. Every INVITE leaving the UAC carries
Max-Forwards: 70 (the default per RFC 3261 §8.1.1.6). On the way
through, every proxy decrements Max-Forwards by one. If a routing loop exists, the
Max-Forwards count reaches zero and a proxy returns
483 Too Many Hops per RFC 3261 §21.4.21 and §16.3 step 6. The trace
from any single proxy in the loop will show INVITEs arriving at it repeatedly with
successively lower Max-Forwards values: same Call-ID, same From-tag, decrementing
Max-Forwards.
Why. Routing loops happen when two proxies have inconsistent routing tables, or when a Route header set is malformed, or when a redirect (3xx) is incorrectly handled and re-injected.
What to look for. Capture on each proxy in the suspected path. Look for the same Call-ID arriving at the same proxy more than once with different Max-Forwards values. If the same Call-ID arrives at the same proxy twice, you have your loop, and the number of decrements between the two arrivals tells you how many hops the loop spans.
7. The 32-second BYE: lost 2xx ACK
Symptom. The call sets up cleanly. The user hears greeting audio. Thirty-two seconds later, the call drops. Both parties hear silence and a hang-up tone.
Fingerprint. 200 OK to INVITE on the access leg. ACK from the UAC. Audio for thirty-two seconds. BYE arrives from the UAS at the 32-second mark. On the UAS's capture, the 200 OK is retransmitted at 0.5s, 1s, 2s, 4s, 4s, 4s, 4s, 4s, 4s, the same T1-doubling-then-capping-at-T2 cadence the transaction layer uses elsewhere, and then the UAS abandons the dialog and sends BYE.
Why. This is one of the genuinely subtle parts of RFC 3261. The 2xx ACK is not generated by the transaction layer: it is generated by the UAC's TU as a fresh transaction, per §13.3.1.4 (UAC) and §17.1.1.3 (the INVITE Client transaction terminates immediately on receiving 2xx). Symmetrically on the UAS side, retransmission of a 2xx response is performed by the UAS Core, not by the INVITE Server transaction's Timer G: RFC 3261 §13.3.1.4 specifies that the UAS retransmits 2xx with intervals starting at T1 and doubling to T2, terminating at 64×T1 = 32 seconds, until ACK is received. (Timer G/H in §17.2.1 drives the equivalent retransmission for non-2xx final responses.) Either way, if the UAC's ACK never reaches the UAS, the UAS retransmits the 2xx until the 32-second cap and then abandons the dialog with BYE.
What to look for. Paired captures. If the UAC sent the ACK but the UAS never saw it, the ACK was lost between them, most often because Contact was rewritten by an intermediary on the 200 OK but Record-Route was missing, so the ACK followed Contact straight back to the UAS while bypassing the intermediary that anchored the path. The 32-second timing and the retransmitted 200 OKs are the unambiguous fingerprint of this failure mode.
8. Stripped Supported: 100rel: no PRACK, lost early media
Symptom. Calls answer cleanly but the prompt or announcement at the start of the call is intermittent. Some calls play it, some do not. Customer complaint is "the IVR sometimes doesn't play".
Fingerprint. The UAC's INVITE carries
Supported: 100rel. The 183 Session Progress arrives with SDP, and it
arrives exactly once. No PRACK is sent in either direction; no
Require: 100rel or RSeq header appears anywhere. The 200
OK eventually arrives normally. Early media is lost on every call where the single
183-with-SDP packet failed to reach the UAC, because there is no PRACK round-trip
to recover it.
Why. RFC 3262 makes provisional responses reliable only when both
endpoints negotiate the 100rel extension, which means the 1xx is sent
with Require: 100rel and RSeq:, and is retransmitted by
the UAS Core until a matching PRACK with RAck: arrives.
Unreliable 1xx responses (no 100rel negotiation) are sent
once by the UAS and never retransmitted by the transaction layer; RFC 3261 §17.2.1
specifies retransmission only for final responses, not for 1xx. So if
Supported: 100rel is stripped on the way to the UAS by an intermediary
that does not relay PRACK, or if the option-tag was never included in the first
place, the UAS sends 1xx-with-SDP exactly once, and a single dropped UDP packet
permanently loses the early-media SDP for that call. There is no recovery
mechanism.
What to look for. Search for Require: 100rel or
RAck: headers anywhere in the trace. If neither is present and there
is a 1xx-with-SDP, you are looking at unreliable early media. Capture upstream and
downstream of every intermediary to identify the one that strips
Supported: 100rel. SBCs configured for "minimal interop" defaults are
the usual culprit.
9. One-way audio
Symptom. Call sets up. Both parties' clients indicate the call is connected. One party hears the other; the other party hears silence.
Fingerprint. SIP signalling is clean: INVITE, 100, 180, 200, ACK
in order, no errors, no retransmits. RTP, captured on each leg, flows in one
direction only. The receiving side either has no inbound RTP at all or has RTP
arriving from an unexpected source IP. The SDP c= line and
m= port match on both legs' SDP.
Why. Three causes account for almost all one-way audio:
-
NAT pinhole closure on the inbound side. The receiving UAC is
behind NAT and the outbound RTP stream from it has not yet opened the return
pinhole. If both sides wait for the other to send first, neither pinhole opens.
The mitigations come from two related specs: RFC 4961 ("Symmetric RTP / RTP
Control Protocol") which formalises sending RTP back to the source IP/port that
inbound RTP arrived from rather than to the SDP-advertised address, and
RFC 7362 ("Latching") which describes how an SBC implements that behaviour as a
hosted-NAT-traversal mechanism. RFC 3605 governs the
a=rtcpSDP attribute used to keep RTCP latching consistent with RTP. Symmetric-RTP enforcement on the SBC is the standard fix. -
SDP
c=/m=advertise an address the other side cannot route to. A UAC behind NAT advertises its private address inc=. The far end faithfully sends RTP to a10.xaddress it cannot reach. Without an SBC rewriting the SDP, no media flows that direction. - Asymmetric routing through the SBC. Signalling and media take different paths through the network, and one of those paths has a stateful firewall that allows only the established direction.
What to look for. Capture RTP on both legs of both sides. Compare
the source and destination IPs of every RTP packet against the SDP
c=/m= lines that negotiated the call. Count packets in
each direction over a ten-second window. Asymmetry of more than a few packets per
second is the unambiguous signal of one-way audio. The signalling will look
perfect; the diagnosis is always in the RTP.
10. Session-timer expiry: the silent mid-call drop
Symptom. Long calls, typically more than fifteen minutes, drop without warning. Short calls are fine. Customer complaint is "the call just hung up after half an hour".
Fingerprint. The INVITE carried Supported: timer and
Session-Expires: 1800 (or whatever the policy is). The 200 OK
confirmed the timer with Session-Expires: 1800;refresher=uac. Mid-call,
no re-INVITE or UPDATE arrives within the timer interval. At or near the timer
expiry, one side sends BYE per RFC 4028. The trace before the BYE is silent of any
session-refreshing signalling.
Why. RFC 4028 specifies that whichever party is the
refresher must send a re-INVITE or UPDATE within the agreed
Session-Expires interval, or the session is considered to have failed
and either party may send BYE. If the refresher's policy or implementation skips
the refresh, or if the refresher's refresh is sent but never arrives at the other
side because of a transient routing problem, the receiver tears down the session
unilaterally.
What to look for. The fingerprint is a clean call followed by a
silent BYE at or shortly before the negotiated Session-Expires
interval. For a 1800-second policy, complaints will cluster around the half-hour
mark; for a 600-second policy, around the ten-minute mark. Confirm by searching the
in-dialog signalling for re-INVITEs or UPDATEs that should have refreshed the
session: there should be at least one before the timer fires, and there is not.
Pay particular attention to whether the refresher role landed where you intended;
if the SBC negotiated itself as refresher and the SBC's session-timer module is
disabled or misconfigured, it is the SBC's missing refresh that drops the call.
How these compose
Real production failures rarely hit just one mode. A misconfigured SBC will
simultaneously strip Supported: 100rel (failure 8), fail to rewrite
Contact correctly on certain trunks (failure 7), and impose an aggressive codec
policy that rejects calls from devices using G.722 (failure 4). The diagnostic
discipline is the same in every case: paired captures on both sides of every
signalling intermediary, paired RTP captures on every media intermediary, and a
willingness to read the trace fingerprints rather than guess from the symptom alone.
The trace tells you. Most of the time, the trace is telling you something you can recognise from this list within thirty seconds of opening it.
Build the diagnostic instinct
SIPT-101: SIP Fundamentals covers transactions, dialogs, the offer/answer model, and the diagnostic framework that ties every one of the patterns in this post back to a precise clause in RFC 3261. The accompanying lab exercises walk through real PCAPs of each failure mode so you build the visual recognition that turns a thirty-minute diagnosis into a thirty-second one.
SIPT-101 outline