tc_on_alarm() and on_writable() race, resulting in the following:
D1219 08:59:33.425951347 86323] CLIENT_CONNECT: ipv4: on_writable: error="No Error"
D1219 08:59:33.426032150 86342] CLIENT_CONNECT: ipv4: on_alarm: error="No Error"
// At this point, note that the callbacks are running on different threads.
D1219 08:59:33.426063521 86323] XXX on_writable ac->addr_str 0x603000008dd0 before unlock. # refs 2->1. Done 0
// on_writable() unrefs while still holding the lock. Because refs > 0, it marks its "done" as false and unlocks.
D1219 08:59:33.426125130 86342] XXX tc_on_alarm ac->addr_str 0x603000008dd0 before unlock. # refs 1->0. Done 1
// right after on_writable() unlocks, tc_on_alarm() acquires the lock and unrefs, this time getting to zero and marking its "done" as true.
// It then proceeds to destroy "ac", and, in particular for this failure, "ac->addr_str".
D1219 08:59:33.426139370 86323] XXX on_writable about to read from ac->addr_str 0x603000008dd0. Done 0, error=OS Error
// When on_writable() tries to read ac->addr_str to assemble its error details, it causes a use-after-free.
The problem is the lock isn't held long enough by on_writable(). Alternatively, a copy of ac->addr_str could be made in on_writable() while still holding the lock, but that seems more fragile. It doesn't seem that holding the lock longer would be a performance issue, given we are in a failure scenario.
When we set the call state to "CANCELLED" after
grpc_cancel_all_calls, we would block other start batch
operations from happening. The rpc_state for the cancelled
call would still be in the server's rpc_states set, but it
would never get removed because there were no active batches
for the call, and the only place we remove from rpc_states is
when a batch completes.
It is better to rely on c-core's cancellation. Once a call
is cancelled, all subsequent ops on that call will return
immediately with a cancellation error.
The RLock() change is due to the possibility that
gets invoked immediately when the call has already completed when the
rpc_future callback is created.
gRPC Python required RPCs terminating with non-OK status code to still
return a valid response value after calling set_code, even though the
response value was not supposed to be communicated to the client, and
returning None is considered a programming error.
This commit introduces an alternative mechanism to terminate RPCs by
calling the `abort` method on `ServicerContext` passed to the handler,
which raises an exception and signals to the gRPC runtime to abort the
RPC with the specified status code and details.