[client channel] if picker is null, fail call instead of crashing (#36167)

We've seen some recent crashes with the work_serializer_dispatch experiment where the picker is null when we try to do an LB pick.  The only code-path we've found where the picker is set to null is the one triggered by the channel_idle filter, so we think we may be hitting a race condition where the call is started on the client_channel filter after the request to go idle has hit the WorkSerializer but before it has actually run.

If that is the cause, then the right fix is to ensure that once we dispatch the request to go idle, we enqueue any subsequent RPC and dispatch a request to exit idle.  However, that will require a bit of work, and it will be easier to do as part of the new call v3 channel stack, so we don't want to invest time in fixing this in the current implementation.

For now, we add a work-around where we fail the RPC if the picker is null, which is better than crashing but not ideal.

Closes #36167

COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/36167 from markdroth:client_channel_idle_race_workaround 0207dd3bdd
PiperOrigin-RevId: 618271781
pull/36168/head
Mark D. Roth 8 months ago committed by Copybara-Service
parent c2758e87fc
commit 604a8b9155
  1. 9
      src/core/client_channel/client_channel_filter.cc

@ -2894,6 +2894,15 @@ ClientChannelFilter::LoadBalancedCall::PickSubchannel(bool was_queued) {
set_picker(chand_->picker_);
}
while (true) {
// TODO(roth): Fix race condition in channel_idle filter and any
// other possible causes of this.
if (pickers.back() == nullptr) {
if (GRPC_TRACE_FLAG_ENABLED(grpc_client_channel_lb_call_trace)) {
gpr_log(GPR_ERROR, "chand=%p lb_call=%p: picker is null, failing call",
chand_, this);
}
return absl::InternalError("picker is null -- shouldn't happen");
}
// Do pick.
if (GRPC_TRACE_FLAG_ENABLED(grpc_client_channel_lb_call_trace)) {
gpr_log(GPR_INFO, "chand=%p lb_call=%p: performing pick with picker=%p",

Loading…
Cancel
Save