TCP backpressure

Backpressure in a distributed system allows receiving nodes to notify sending nodes that they temporarily lack the capacity to handle further requests. Without backpressure the receiving node must attempt to handle every request and, if overloaded, must shed load by returning error responses, often causing more work as the sender retries the failed requests.

A very useful (if somewhat misunderstood) feature of TCP is its built-in support for backpressure. Every TCP packet includes a 16-bit window field in its header which indicates how many more bytes the sender can accept. Typically this value is shifted by some number of bits known as the window scale set with a TCP option during the initial handshake (and potentially modified later) as defined in RFC 7323. A node, receiving a stream of data faster than it can handle it, applies backpressure by reducing the window size in its TCP acknowledgements. Crucially, the node may reduce this window size all the way down to zero, indicating that it can handle no more data. This backpressure mechanism is built into the operating system’s TCP implementation and the only thing that an application needs to do to access it is to stop reading data from the socket in question while it is overloaded. Once the overload has passed, the application reads from the socket again which causes the kernel to transmit a packet advertising that the window is open again, inviting the sender to continue sending data.

Infinite patience

One of the misunderstandings about this mechanism is the belief that there must be some kind of timeout after which a connection in the zero-window backpressure state indicates an error condition and should be closed. RFC 1122 says that this is not the case:

A TCP MAY keep its offered receive window closed
indefinitely.  As long as the receiving TCP continues to
send acknowledgments in response to the probe segments, the
sending TCP MUST allow the connection to stay open.

There is no need for any timeout here because the zero-window state is actively maintained by the two endpoints: the prospective sender repeatedly sends so-called zero-window probes to which the receiver responds, indicating that both ends remain alive but that the backpressure situation persists. This is essential because when the backpressure is released the receiver sends a single window-open advertisement but this packet may be lost. As long as the sender eventually sends another probe it will eventually discover that the backpressure has been released.

These repeated zero-window probes also ensure that the sender eventually detects a network partition by watching for a sufficiently long sequence of consecutive probes to which it has not received a response, as if it were sending TCP keepalives.

Probe timings

RFC 1122 has some recommendations about the exact timings of the zero-window probes:

The transmitting host SHOULD send the first zero-window
probe when a zero window has existed for the retransmission
timeout period (see Section 4.2.2.15), and SHOULD increase
exponentially the interval between successive probes.

[...]                   Exponential backoff is
recommended, possibly with some maximum interval not
specified here.

In practice a maximum is essential, or else the sender may wait for unreasonably long before discovering that the window has reopened. For example, if it backed off by a factor of 2 on every probe with no maximum and the window-opening packet went undelivered then the sender would effectively wait for twice the length of the backpressure period before discovering that the window has reopened.

But how exactly are these timings calculated?

In Linux by default the zero-window probes are scheduled similarly to regular retransmissions, starting at RTO_MIN (200ms) and backing off repeatedly by a factor of 2 up to a maximum of RTO_MAX (2 minutes), which it reaches after the backpressure has lasted for a little under 3½ minutes:

Probe	Start/mm:ss	Timeout	Timeout/s	End/mm:ss.s
0	0:00.0	`RTO_MIN`	0.2	0:00.2
1	0:00.2	`2 × RTO_MIN`	0.4	0:00.6
2	0:00.6	`4 × RTO_MIN`	0.8	0:01.4
3	0:01.4	`8 × RTO_MIN`	1.6	0:03.0
4	0:03.0	`16 × RTO_MIN`	3.2	0:06.2
5	0:06.2	`32 × RTO_MIN`	6.4	0:12.6
6	0:12.6	`64 × RTO_MIN`	12.8	0:25.4
7	0:25.4	`128 × RTO_MIN`	25.6	0:51.0
8	0:51.0	`258 × RTO_MIN`	51.2	1:42.2
9	1:42.2	`512 × RTO_MIN`	102.4	3:24.6
10	3:24.6	`RTO_MAX`	120.0	5:24.6
11	5:24.6	`RTO_MAX`	120.0	7:24.6
12	7:24.6	`RTO_MAX`	120.0	9:24.6
13	9:24.6	`RTO_MAX`	120.0	11:24.6
14	11:24.6	`RTO_MAX`	120.0	13:24.6
⋮	⋮	⋮	⋮	⋮

Thus the default behaviour is that the resolution of a backpressure situation which lasted for just a few minutes might not be noticed for a further two minutes (RTO_MAX) in the unfortunate, but not uncommon, event that the single window-opening packet goes undelivered. Occasionally the first probes after the window opens may also go unacknowledged, each time adding another two minutes to any backpressure-related delays. That seems awfully long to me.

This also raises the question of how the system deals with unacknowledged probes, such as would happen if the network were partitioned or the receiving process were no longer running.

The answer is that the tcp_retries2 sysctl works similarly on zero-window probes to how it works with regular retransmissions: if more than tcp_retries2 consecutive zero-window probes go unacknowledged then the connection fails. With the default value of 15, this means that a connection across a network partition might not be considered faulty for a whopping 30 minutes (15 × RTO_MAX) after the start of the partition. Yikes!

Less is more

The best way to reduce the time it takes to detect a network partition in a backpressure situation is to reduce tcp_retries2 to a more reasonable value, just as in the non-backpressure case. By setting tcp_retries2 to 5 the system will close the connection and report the failure to the application after just 5 unacknowledged probes in a row.

If the interval between the zero-window probes were allowed to grow up to RTO_MAX then waiting for 5 of them to go unacknowledged would still take a pretty dreadful 10 minutes. However, the tcp_retries2 sysctl also limits the time between the probes. I couldn’t find this mentioned in any documentation but this is where it’s implemented in tcp_send_probe0():

if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
    icsk->icsk_backoff++;

Here icsk->icsk_backoff is the backoff counter, visible using tools such as ss -tonie, and from which the re-probe interval is computed. The effect of this code is to stop increasing the backoff counter, and thus the re-probe interval, once it reaches tcp_retries2. By default this allows the re-probe interval to increase all the way to RTO_MAX, but if tcp_retries2 is 5 then the interval between zero-window probes will not increase beyond 32 × RTO_MIN which is a little over 6 seconds:

Probe	Start/mm:ss	Timeout	Timeout/s	End/mm:ss.s
0	0:00.0	`RTO_MIN`	0.2	0:00.2
1	0:00.2	`2 × RTO_MIN`	0.4	0:00.6
2	0:00.6	`4 × RTO_MIN`	0.8	0:01.4
3	0:01.4	`8 × RTO_MIN`	1.6	0:03.0
4	0:03.0	`16 × RTO_MIN`	3.2	0:06.2
5	0:06.2	`32 × RTO_MIN`	6.4	0:12.6
6	0:12.6	`32 × RTO_MIN`	6.4	0:19.0
7	0:19.0	`32 × RTO_MIN`	6.4	0:25.4
8	0:25.4	`32 × RTO_MIN`	6.4	0:31.8
9	0:31.8	`32 × RTO_MIN`	6.4	0:38.2
10	0:38.2	`32 × RTO_MIN`	6.4	0:44.6
⋮	⋮	⋮	⋮	⋮

This means that a prospective sender will be able to pick up the open window within a few seconds even if the window-opening packet goes undelivered, losing only a few more seconds on each undelivered probe, and a network partition will be detected in 5 × 32 × RTO_MIN which is a little over 30 seconds, surely vastly preferable to the 30-minute default.

User timeouts

Cloudflare has a blog post about detecting dead TCP connections which concludes that typical applications sending data to the internet should set Linux’s TCP_USER_TIMEOUT socket option to be equal to the overall TCP keepalive timeout (i.e. TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT) so that sockets with nonempty send buffers can still detect network partitions in as timely a fashion as ones with empty send buffers.

Linux’s TCP_USER_TIMEOUT socket option has the following meaning according to man tcp:

TCP_USER_TIMEOUT (since Linux 2.6.37)
    This option takes an unsigned int as an argument.  When the
    value is greater than 0, it specifies the maximum amount of
    time in milliseconds that transmitted data may remain
    unacknowledged, or buffered data may remain untransmitted
    (due to zero window size) before TCP will forcibly close
    the corresponding connection and return ETIMEDOUT to the
    application.  If the option value is specified as 0, TCP
    will use the system default.

    [...]

    Further details on the user timeout feature can be found in
    RFC 793 and RFC 5482 ("TCP User Timeout Option").

However, RFC 5482 describes a subtly different timeout:

The Transmission Control Protocol (TCP) specification [RFC0793]
defines a local, per-connection "user timeout" parameter that
specifies the maximum amount of time that transmitted data may remain
unacknowledged before TCP will forcefully close the corresponding
connection.

This RFC specifies a TCP option allowing endpoints to communicate such a timeout to each other, but as far as I can tell Linux doesn’t make use of this facility even if TCP_USER_TIMEOUT is set.

Confusingly RFC 793 describes such a “user timeout” with different semantics again:

The timeout, if present, permits the caller to set up a timeout
for all data submitted to TCP.  If data is not successfully
delivered to the destination within the timeout period, the TCP
will abort the connection.  The present global default is five
minutes.

This is the timeout specified in Linux by the SO_SNDTIMEO socket option, not TCP_USER_TIMEOUT.

The difference between the timeout described in RFC 5482 and the implementation of the TCP_USER_TIMEOUT option in Linux is subtle but vitally important when considering TCP backpressure. The RFC 5482 timeout only considers unacknowledged data, but a TCP connection in a zero-window state has no unacknowledged data and thus this timeout should have no effect. In contrast, the TCP_USER_TIMEOUT socket option also considers untransmitted data and thus imposes a time limit on any backpressure situation after which the connection is closed, violating RFC 1122:

A TCP MAY keep its offered receive window closed
indefinitely.  As long as the receiving TCP continues to
send acknowledgments in response to the probe segments, the
sending TCP MUST allow the connection to stay open.

Unfortunately this makes this feature useless, indeed harmful, in a system that relies on TCP backpressure. I can imagine ways that it might be appropriate to use in Cloudflare’s particular situation, but it does not apply more generally.

Although the tcp_retries2 option does get a brief mention in Cloudflare’s post, the author fails to explore the consequences of reducing this from its unreasonably large default of 15 down to something more sensible. Had they done so, they might have concluded that this is a more effective solution to the problems they were describing than the backpressure-incompatible TCP_USER_TIMEOUT option.