From 604a8b9155aacb98d9206f2d493bac0cf7f2aa74 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 22 Mar 2024 13:27:33 -0700 Subject: [PATCH] [client channel] if picker is null, fail call instead of crashing (#36167) We've seen some recent crashes with the work_serializer_dispatch experiment where the picker is null when we try to do an LB pick. The only code-path we've found where the picker is set to null is the one triggered by the channel_idle filter, so we think we may be hitting a race condition where the call is started on the client_channel filter after the request to go idle has hit the WorkSerializer but before it has actually run. If that is the cause, then the right fix is to ensure that once we dispatch the request to go idle, we enqueue any subsequent RPC and dispatch a request to exit idle. However, that will require a bit of work, and it will be easier to do as part of the new call v3 channel stack, so we don't want to invest time in fixing this in the current implementation. For now, we add a work-around where we fail the RPC if the picker is null, which is better than crashing but not ideal. Closes #36167 COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/36167 from markdroth:client_channel_idle_race_workaround 0207dd3bdda9208774b7d2c64f088d39e3a48148 PiperOrigin-RevId: 618271781 --- src/core/client_channel/client_channel_filter.cc | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/core/client_channel/client_channel_filter.cc b/src/core/client_channel/client_channel_filter.cc index c3820ed8846..b6123287474 100644 --- a/src/core/client_channel/client_channel_filter.cc +++ b/src/core/client_channel/client_channel_filter.cc @@ -2894,6 +2894,15 @@ ClientChannelFilter::LoadBalancedCall::PickSubchannel(bool was_queued) { set_picker(chand_->picker_); } while (true) { + // TODO(roth): Fix race condition in channel_idle filter and any + // other possible causes of this. + if (pickers.back() == nullptr) { + if (GRPC_TRACE_FLAG_ENABLED(grpc_client_channel_lb_call_trace)) { + gpr_log(GPR_ERROR, "chand=%p lb_call=%p: picker is null, failing call", + chand_, this); + } + return absl::InternalError("picker is null -- shouldn't happen"); + } // Do pick. if (GRPC_TRACE_FLAG_ENABLED(grpc_client_channel_lb_call_trace)) { gpr_log(GPR_INFO, "chand=%p lb_call=%p: performing pick with picker=%p",