Every senior VoIP engineer eventually develops the same diagnostic instinct: open the trace, glance at the ladder, and know within thirty seconds which of a small handful of failure modes you are dealing with. The trace tells you. The numbers on the responses, the timing of the retransmissions, and the few headers that are present, missing, or wrong narrow the diagnosis to one of about ten patterns long before you have read a single SDP body.

This post is the field guide. Ten failure modes that account for the overwhelming majority of "it doesn't work" tickets in production SIP, the precise trace fingerprint each leaves, and the RFC clause that explains the behaviour. The list is opinionated (there are more than ten failure modes in SIP) but these are the ten that we see again and again across carriers, hosted PBX operators, and enterprise voice platforms. If you can recognise these by sight, you have eliminated about 80% of the noise in the average post-mortem.

1. DNS NAPTR/SRV resolution failure

Symptom. The user clicks call. Nothing visible happens for ten to thirty seconds. Then a SIP 503 Service Unavailable or a hard timeout. The UAC's call counter never increments on the proxy.

Fingerprint. No INVITE on the wire from the UAC at all, or an INVITE sent to the wrong address family. A packet capture wider than SIP shows a NAPTR query for the destination domain, no answer, an SRV fallback, no answer, and an A/AAAA fallback that succeeds, often pointing at a stale or wrong host. The total elapsed time before any SIP traffic appears is five to fifteen seconds, dominated by DNS resolver retries.

Why. RFC 3263 defines a strict ordering for SIP server resolution: NAPTR first to discover transports, SRV second to discover the per-transport host/port pairs, and A/AAAA only as a last resort. When the NAPTR or SRV records are missing or stale, every UAC has to time-out the higher tiers before falling through to the next, and each timeout is on the order of seconds.

What to look for. Capture on UDP/53 alongside the SIP capture. If the SIP capture starts more than a couple of seconds after the user-initiated event and the destination domain is using a fully-qualified domain rather than an IP literal, NAPTR/SRV is the suspect. The fix is almost always provisioning, not protocol, but the engineer who diagnoses it from a SIP-only capture is going to chase ghosts for hours.

2. INVITE retransmission storm

Symptom. The call eventually fails, but the trace looks busy. Every INVITE is repeated multiple times, and so is every other request method.

Fingerprint. Identical INVITEs (same branch parameter, same Call-ID, same CSeq) at 0.5s, 1.5s, 3.5s, 7.5s, 15.5s, gaps doubling each time, and the transaction abandoning at the 32-second mark. No 1xx response from the downstream element. The signature is the geometric backoff defined for the INVITE Client transaction in RFC 3261 §17.1.1.2: Timer A starts at T1 (default 500ms) and doubles on each retransmission until Timer B (64×T1 = 32s) terminates the transaction. Note that Timer A does not cap at T2: the T2 cap applies to the non-INVITE Client (Timer E) and the INVITE Server (Timer G), not to INVITE Client retransmissions.

Why. The UAC has not received a provisional response, so it is following the standard transaction-layer retransmission timer. Either the next-hop element has not received the INVITE at all (transport-level: routing, firewall, NAT pinhole) or it has received it and is failing to send 100 Trying within T1.

What to look for. Compare the inter-retransmit gaps to the canonical Timer A doubling sequence: 0.5s, 1s, 2s, 4s, 8s, 16s. If the timing matches, the retransmission is correct UAC behaviour reacting to silence from the next hop. The fault is upstream of the UAC's transport, not in the UAC. A packet capture on the next-hop element will show whether the INVITE arrived. If it did, the next-hop element is failing to send 100 Trying, and a stateful proxy is expected by RFC 3261 §16.7 to send 100 Trying within 200ms to suppress these retransmissions, so the absence of any 1xx is itself a sign of a misbehaving proxy or a stateless one masquerading as stateful.

3. The 401/407 authentication challenge loop

Symptom. Registration or call setup never completes. The trace shows a polite back-and-forth between client and server, never resolving.

Fingerprint. The UAC sends REGISTER (or INVITE) without credentials. The server returns 401 Unauthorized (registrar) or 407 Proxy Authentication Required (proxy) with a WWW-Authenticate or Proxy-Authenticate header carrying a fresh nonce. The UAC responds with credentials. The server returns another 401/407 with another fresh nonce. The UAC tries again. And again. The CSeq increments on every attempt; the Call-ID in REGISTER stays the same.

Why. RFC 3261 §22 and the underlying digest scheme (RFC 7616, formerly RFC 2617) compute the response hash from the username, password, realm, nonce, URI, method, and an optional qop=auth body that mixes in the cnonce and nc value. If any of those inputs is wrong on the client side, most commonly the realm or the password, every retry produces a hash the server cannot verify, and the server reissues a fresh challenge. The clue is that the second challenge's nonce differs from the first.

What to look for. Look at the realm in the challenge and the realm the UAC's digest response was computed against. Mismatches happen routinely when an SBC inserts itself as the authenticating element and the UAC has been provisioned against the upstream registrar's realm. Look also at the algorithm (MD5, MD5-sess, SHA-256), qop, and opaque parameters: a UAC that mishandles qop=auth will fail every challenge silently. Two consecutive 401/407 with different nonces is the unmistakable signal that the credentials, not the connectivity, are at fault.

4. 488 Not Acceptable Here: codec mismatch

Symptom. The call is rejected immediately. The user hears a busy or "call cannot be completed" indication.

Fingerprint. The UAS responds to INVITE with 488 Not Acceptable Here. There is no 180 Ringing, no 200 OK. The UAC's outbound SDP offer carries one set of m=audio codecs (e.g. RTP/AVP 9 0 8) and the answerer rejects the entire session.

Why. RFC 3261 §21.4.26 reserves 488 for the case where the answerer has examined the offered SDP and finds no acceptable subset. RFC 3264 §6 then defines a finer-grained per-stream rejection mechanism: a stream the answerer wishes to decline is signalled by setting its m= line port to zero (e.g. m=audio 0 RTP/AVP 0) in an answer where the rest of the session is acceptable. So 488 means no part of the offer was usable; port-zero on a single m= line is the right tool for "this one stream is unusable but the rest of the session is fine".

What to look for. Read the offered SDP and compare it against what the answerer's policy permits, almost always a codec configuration mismatch. The common forensic trap is that the answerer in question is not the original UAS but an SBC enforcing a codec policy on behalf of the core. If the trace shows the SBC sourcing the 488, examine the SBC's codec filter; if the trace shows the UAS sourcing it, the codec list never reached anything that would have been mutually acceptable.

5. 423 Interval Too Brief

Symptom. A device's REGISTER fails immediately. The phone shows "registration failed" or "service unavailable". The user retries, with the same result.

Fingerprint. The UAC sends REGISTER with Expires: 60 (or some small value). The registrar returns 423 Interval Too Brief with a Min-Expires: 3600 header. The UAC, depending on its sophistication, either gives up or correctly retries with Expires: 3600 and succeeds.

Why. RFC 3261 §10.3 step 7 lets a registrar reject a REGISTER whose requested expiration is below its minimum policy threshold and return the threshold via the Min-Expires header field (defined in §20.23). A standards-compliant UAC adapts and retries; a poorly-implemented UAC reads the 4xx as a hard fail.

What to look for. A single 423 in a registration trace is almost never a fault: it is the registrar negotiating up. The fault is when the UAC does not retry with the supplied Min-Expires value. If the trace shows the UAC retrying its original Expires: 60 and getting a second 423, the UAC is broken. Devices that misbehave here are often older hardware phones with hardcoded refresh intervals or test scripts that ignore response headers.

6. 483 Too Many Hops: the routing loop

Symptom. The call fails after a brief pause. The trace looks short and undramatic. The originating UAC sees a 483.

Fingerprint. Every INVITE leaving the UAC carries Max-Forwards: 70 (the default per RFC 3261 §8.1.1.6). On the way through, every proxy decrements Max-Forwards by one. If a routing loop exists, the Max-Forwards count reaches zero and a proxy returns 483 Too Many Hops per RFC 3261 §21.4.21 and §16.3 step 6. The trace from any single proxy in the loop will show INVITEs arriving at it repeatedly with successively lower Max-Forwards values: same Call-ID, same From-tag, decrementing Max-Forwards.

Why. Routing loops happen when two proxies have inconsistent routing tables, or when a Route header set is malformed, or when a redirect (3xx) is incorrectly handled and re-injected.

What to look for. Capture on each proxy in the suspected path. Look for the same Call-ID arriving at the same proxy more than once with different Max-Forwards values. If the same Call-ID arrives at the same proxy twice, you have your loop, and the number of decrements between the two arrivals tells you how many hops the loop spans.

7. The 32-second BYE: lost 2xx ACK

Symptom. The call sets up cleanly. The user hears greeting audio. Thirty-two seconds later, the call drops. Both parties hear silence and a hang-up tone.

Fingerprint. 200 OK to INVITE on the access leg. ACK from the UAC. Audio for thirty-two seconds. BYE arrives from the UAS at the 32-second mark. On the UAS's capture, the 200 OK is retransmitted at 0.5s, 1s, 2s, 4s, 4s, 4s, 4s, 4s, 4s, the same T1-doubling-then-capping-at-T2 cadence the transaction layer uses elsewhere, and then the UAS abandons the dialog and sends BYE.

Why. This is one of the genuinely subtle parts of RFC 3261. The 2xx ACK is not generated by the transaction layer: it is generated by the UAC's TU as a fresh transaction, per §13.3.1.4 (UAC) and §17.1.1.3 (the INVITE Client transaction terminates immediately on receiving 2xx). Symmetrically on the UAS side, retransmission of a 2xx response is performed by the UAS Core, not by the INVITE Server transaction's Timer G: RFC 3261 §13.3.1.4 specifies that the UAS retransmits 2xx with intervals starting at T1 and doubling to T2, terminating at 64×T1 = 32 seconds, until ACK is received. (Timer G/H in §17.2.1 drives the equivalent retransmission for non-2xx final responses.) Either way, if the UAC's ACK never reaches the UAS, the UAS retransmits the 2xx until the 32-second cap and then abandons the dialog with BYE.

What to look for. Paired captures. If the UAC sent the ACK but the UAS never saw it, the ACK was lost between them, most often because Contact was rewritten by an intermediary on the 200 OK but Record-Route was missing, so the ACK followed Contact straight back to the UAS while bypassing the intermediary that anchored the path. The 32-second timing and the retransmitted 200 OKs are the unambiguous fingerprint of this failure mode.

8. Stripped Supported: 100rel: no PRACK, lost early media

Symptom. Calls answer cleanly but the prompt or announcement at the start of the call is intermittent. Some calls play it, some do not. Customer complaint is "the IVR sometimes doesn't play".

Fingerprint. The UAC's INVITE carries Supported: 100rel. The 183 Session Progress arrives with SDP, and it arrives exactly once. No PRACK is sent in either direction; no Require: 100rel or RSeq header appears anywhere. The 200 OK eventually arrives normally. Early media is lost on every call where the single 183-with-SDP packet failed to reach the UAC, because there is no PRACK round-trip to recover it.

Why. RFC 3262 makes provisional responses reliable only when both endpoints negotiate the 100rel extension, which means the 1xx is sent with Require: 100rel and RSeq:, and is retransmitted by the UAS Core until a matching PRACK with RAck: arrives. Unreliable 1xx responses (no 100rel negotiation) are sent once by the UAS and never retransmitted by the transaction layer; RFC 3261 §17.2.1 specifies retransmission only for final responses, not for 1xx. So if Supported: 100rel is stripped on the way to the UAS by an intermediary that does not relay PRACK, or if the option-tag was never included in the first place, the UAS sends 1xx-with-SDP exactly once, and a single dropped UDP packet permanently loses the early-media SDP for that call. There is no recovery mechanism.

What to look for. Search for Require: 100rel or RAck: headers anywhere in the trace. If neither is present and there is a 1xx-with-SDP, you are looking at unreliable early media. Capture upstream and downstream of every intermediary to identify the one that strips Supported: 100rel. SBCs configured for "minimal interop" defaults are the usual culprit.

9. One-way audio

Symptom. Call sets up. Both parties' clients indicate the call is connected. One party hears the other; the other party hears silence.

Fingerprint. SIP signalling is clean: INVITE, 100, 180, 200, ACK in order, no errors, no retransmits. RTP, captured on each leg, flows in one direction only. The receiving side either has no inbound RTP at all or has RTP arriving from an unexpected source IP. The SDP c= line and m= port match on both legs' SDP.

Why. Three causes account for almost all one-way audio:

  1. NAT pinhole closure on the inbound side. The receiving UAC is behind NAT and the outbound RTP stream from it has not yet opened the return pinhole. If both sides wait for the other to send first, neither pinhole opens. The mitigations come from two related specs: RFC 4961 ("Symmetric RTP / RTP Control Protocol") which formalises sending RTP back to the source IP/port that inbound RTP arrived from rather than to the SDP-advertised address, and RFC 7362 ("Latching") which describes how an SBC implements that behaviour as a hosted-NAT-traversal mechanism. RFC 3605 governs the a=rtcp SDP attribute used to keep RTCP latching consistent with RTP. Symmetric-RTP enforcement on the SBC is the standard fix.
  2. SDP c=/m= advertise an address the other side cannot route to. A UAC behind NAT advertises its private address in c=. The far end faithfully sends RTP to a 10.x address it cannot reach. Without an SBC rewriting the SDP, no media flows that direction.
  3. Asymmetric routing through the SBC. Signalling and media take different paths through the network, and one of those paths has a stateful firewall that allows only the established direction.

What to look for. Capture RTP on both legs of both sides. Compare the source and destination IPs of every RTP packet against the SDP c=/m= lines that negotiated the call. Count packets in each direction over a ten-second window. Asymmetry of more than a few packets per second is the unambiguous signal of one-way audio. The signalling will look perfect; the diagnosis is always in the RTP.

10. Session-timer expiry: the silent mid-call drop

Symptom. Long calls, typically more than fifteen minutes, drop without warning. Short calls are fine. Customer complaint is "the call just hung up after half an hour".

Fingerprint. The INVITE carried Supported: timer and Session-Expires: 1800 (or whatever the policy is). The 200 OK confirmed the timer with Session-Expires: 1800;refresher=uac. Mid-call, no re-INVITE or UPDATE arrives within the timer interval. At or near the timer expiry, one side sends BYE per RFC 4028. The trace before the BYE is silent of any session-refreshing signalling.

Why. RFC 4028 specifies that whichever party is the refresher must send a re-INVITE or UPDATE within the agreed Session-Expires interval, or the session is considered to have failed and either party may send BYE. If the refresher's policy or implementation skips the refresh, or if the refresher's refresh is sent but never arrives at the other side because of a transient routing problem, the receiver tears down the session unilaterally.

What to look for. The fingerprint is a clean call followed by a silent BYE at or shortly before the negotiated Session-Expires interval. For a 1800-second policy, complaints will cluster around the half-hour mark; for a 600-second policy, around the ten-minute mark. Confirm by searching the in-dialog signalling for re-INVITEs or UPDATEs that should have refreshed the session: there should be at least one before the timer fires, and there is not. Pay particular attention to whether the refresher role landed where you intended; if the SBC negotiated itself as refresher and the SBC's session-timer module is disabled or misconfigured, it is the SBC's missing refresh that drops the call.

How these compose

Real production failures rarely hit just one mode. A misconfigured SBC will simultaneously strip Supported: 100rel (failure 8), fail to rewrite Contact correctly on certain trunks (failure 7), and impose an aggressive codec policy that rejects calls from devices using G.722 (failure 4). The diagnostic discipline is the same in every case: paired captures on both sides of every signalling intermediary, paired RTP captures on every media intermediary, and a willingness to read the trace fingerprints rather than guess from the symptom alone.

The trace tells you. Most of the time, the trace is telling you something you can recognise from this list within thirty seconds of opening it.

Build the diagnostic instinct

SIPT-101: SIP Fundamentals covers transactions, dialogs, the offer/answer model, and the diagnostic framework that ties every one of the patterns in this post back to a precise clause in RFC 3261. The accompanying lab exercises walk through real PCAPs of each failure mode so you build the visual recognition that turns a thirty-minute diagnosis into a thirty-second one.

SIPT-101 outline