diff --git a/doc/connectivity-semantics-and-api.md b/doc/connectivity-semantics-and-api.md new file mode 100644 index 00000000000..842ff8adc18 --- /dev/null +++ b/doc/connectivity-semantics-and-api.md @@ -0,0 +1,122 @@ +gRPC Connectivity Semantics and API +=================================== + +This document describes the connectivity semantics for gRPC channels and the +corresponding impact on RPCs. We then discuss an API. + +States of Connectivity +---------------------- + +gRPC Channels provide the abstraction over which clients can communicate with +servers.The client-side channel object can be constructed using little more +than a DNS name. Channels encapsulate a range of functionality including name +resolution, establishing a TCP connection (with retries and backoff) and TLS +handshakes. Channels can also handle errors on established connections and +reconnect, or in the case of HTTP/2 GO_AWAY, re-resolve the name and reconnect. + +To hide the details of all this activity from the user of the gRPC API (i.e., +application code) while exposing meaningful information about the state of a +channel, we use a state machine with four states, defined below: + +CONNECTING: The channel is trying to establish a connection and is waiting to +make progress on one of the steps involved in name resolution, TCP connection +establishment or TLS handshake. This is the initial state for all channels upon +creation. + +READY: The channel has successfully established a connection all the way +through TLS handshake (or equivalent) and all subsequent attempt to communicate +have succeeded (or are pending without any known failure ). + +TRANSIENT_FAILURE: There has been some transient failure (such as a TCP 3-way +handshake timing out or a socket error). Channels in this state will eventually +switch to the CONNECTING state and try to establish a connection again. Since +retries are done with exponential backoff, channels that fail to connect will +start out spending very little time in this state but as the attempts fail +repeatedly, the channel will spend increasingly large amounts of time in this +state. For many non-fatal failures (e.g., TCP connection attempts timing out +because the server is not yet available), the channel may be stuck in this +state for an indefinitely large amount of time. + +FATAL_FAILURE: There has been a fatal failure and the channel will never +attempt to establish a connection again. (e.g., a server presenting an invalid +TLS certificate) + +Channels that enter this state never leave this state. + +The following table lists the legal transitions from one state to another and +corresponding reasons. Empty cells denote disallowed transitions. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
From/ToCONNECTINGREADYTRANSIENT_FAILUREFATAL_FAILURE
CONNECTINGIncremental progress during connection establishmentAll steps needed to establish a connection succeededAny failure in any of the steps needed to establish connectionFatal failure encountered while attempting a connection.
READYIncremental successful communication on established channel.Any failure encountered while expecting successful communication on + established channel.
TRANSIENT_FAILUREWait time required to implement (exponential) backoff is over.
FATAL_FAILURE
+ + +Channel State API +----------------- + +All gRPC libraries will expose a channel-level API method to poll the current +state of a channel. In C++, this method is called GetCurrentState and returns +an enum for one of the four legal states. + +All libraries should also expose an API that enables the application (user of +the gRPC API) to be notified when the channel state changes. Since state +changes can be rapid and race with any such notification, the notification +should just inform the user that some state change has happened, leaving it to +the user to poll the channel for the current state. + +The synchronous version of this API is: + +```cpp +bool WaitForStateChange(gpr_timespec deadline, ChannelState source_state); +``` + +which returns true when the state changes to something other than the +source_state and false if the deadline expires. Asynchronous and futures based +APIs should have a corresponding method that allows the application to be +notified when the state of a channel changes. + +Note that a notification is delivered every time there is a transition from any +state to any *other* state. On the other hand the rules for legal state +transition, require a transition from CONNECTING to TRANSIENT_FAILURE and back +to CONNECTING for every recoverable failure, even if the corresponding +exponential backoff requires no wait before retry. The combined effect is that +the application may receive state change notifications that appear spurious. +e.g., an application waiting for state changes on a channel that is CONNECTING +may receive a state change notification but find the channel in the same +CONNECTING state on polling for current state because the channel may have +spent infinitesimally small amount of time in the TRANSIENT_FAILURE state.