|
|
|
@ -8,58 +8,39 @@ requests) and instead do some form of exponential backoff. |
|
|
|
|
We have several parameters: |
|
|
|
|
1. INITIAL_BACKOFF (how long to wait after the first failure before retrying) |
|
|
|
|
2. MULTIPLIER (factor with which to multiply backoff after a failed retry) |
|
|
|
|
3. MAX_BACKOFF (Upper bound on backoff) |
|
|
|
|
4. MIN_CONNECTION_TIMEOUT |
|
|
|
|
3. MAX_BACKOFF (upper bound on backoff) |
|
|
|
|
4. MIN_CONNECT_TIMEOUT (minimum time we're willing to give a connection to |
|
|
|
|
complete) |
|
|
|
|
|
|
|
|
|
## Proposed Backoff Algorithm |
|
|
|
|
|
|
|
|
|
Exponentially back off the start time of connection attempts up to a limit of |
|
|
|
|
MAX_BACKOFF. |
|
|
|
|
MAX_BACKOFF, with jitter. |
|
|
|
|
|
|
|
|
|
``` |
|
|
|
|
ConnectWithBackoff() |
|
|
|
|
current_backoff = INITIAL_BACKOFF |
|
|
|
|
current_deadline = now() + INITIAL_BACKOFF |
|
|
|
|
while (TryConnect(Max(current_deadline, MIN_CONNECT_TIMEOUT)) |
|
|
|
|
while (TryConnect(Max(current_deadline, now() + MIN_CONNECT_TIMEOUT)) |
|
|
|
|
!= SUCCESS) |
|
|
|
|
SleepUntil(current_deadline) |
|
|
|
|
current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF) |
|
|
|
|
current_deadline = now() + current_backoff |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
## Historical Algorithm in Stubby |
|
|
|
|
|
|
|
|
|
Exponentially increase up to a limit of MAX_BACKOFF the intervals between |
|
|
|
|
connection attempts. This is what stubby 2 uses, and is equivalent if |
|
|
|
|
TryConnect() fails instantly. |
|
|
|
|
current_deadline = now() + current_backoff + |
|
|
|
|
UniformRandom(-JITTER * current_backoff, JITTER * current_backoff) |
|
|
|
|
|
|
|
|
|
``` |
|
|
|
|
LegacyConnectWithBackoff() |
|
|
|
|
current_backoff = INITIAL_BACKOFF |
|
|
|
|
while (TryConnect(MIN_CONNECT_TIMEOUT) != SUCCESS) |
|
|
|
|
SleepFor(current_backoff) |
|
|
|
|
current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF) |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
The grpc C implementation currently uses this approach with an initial backoff |
|
|
|
|
of 1 second, multiplier of 2, and maximum backoff of 120 seconds. (This will |
|
|
|
|
change) |
|
|
|
|
|
|
|
|
|
Stubby, or at least rpc2, uses exactly this algorithm with an initial backoff |
|
|
|
|
of 1 second, multiplier of 1.2, and a maximum backoff of 120 seconds. |
|
|
|
|
With specific parameters of |
|
|
|
|
MIN_CONNECT_TIMEOUT = 20 seconds |
|
|
|
|
INITIAL_BACKOFF = 1 second |
|
|
|
|
MULTIPLIER = 1.6 |
|
|
|
|
MAX_BACKOFF = 120 seconds |
|
|
|
|
JITTER = 0.2 |
|
|
|
|
|
|
|
|
|
## Use Cases to Consider |
|
|
|
|
Implementations with pressing concerns (such as minimizing the number of wakeups |
|
|
|
|
on a mobile phone) may wish to use a different algorithm, and in particular |
|
|
|
|
different jitter logic. |
|
|
|
|
|
|
|
|
|
* Client tries to connect to a server which is down for multiple hours, eg for |
|
|
|
|
maintenance |
|
|
|
|
* Client tries to connect to a server which is overloaded |
|
|
|
|
* User is bringing up both a client and a server at the same time |
|
|
|
|
* In particular, we would like to avoid a large unnecessary delay if the |
|
|
|
|
client connects to a server which is about to come up |
|
|
|
|
* Client/server are misconfigured such that connection attempts always fail |
|
|
|
|
* We want to make sure these don’t put too much load on the server by |
|
|
|
|
default. |
|
|
|
|
* Server is overloaded and wants to transiently make clients back off |
|
|
|
|
* Application has out of band reason to believe a server is back |
|
|
|
|
* We should consider an out of band mechanism for the client to hint that |
|
|
|
|
we should short circuit the backoff. |
|
|
|
|
Alternate implementations must ensure that connection backoffs started at the |
|
|
|
|
same time disperse, and must not attempt connections substantially more often |
|
|
|
|
than the above algorithm. |
|
|
|
|