[chaotic good] fix bugs causing thready_tsan failures in handshaking (#37049)

There were two problems here:

1. A pre-existing bug in the chaotic good connector whereby it was using `std::shared_ptr<>` instead of `RefCountedPtr<>` for `HandshakeManager`, thus failing to account for the internal refs taken inside `HandshakeManager` before deciding to delete the object.
2. A change made as part of #37018 made `HandshakeManager` less tolerant to a caller that unrefs the `HandshakeManager` in the `on_handshake_done` callback, which may be run in another thread while the first thread is still holding the mutex in `HandshakeManager::DoHandshake()`.  To be defensive for this case, `DoHandshake()` now holds its own ref, which it releases after releasing the mutex.

The second problem is in principle not specific to chaotic good, but in practice it almost certainly is, because the chttp2 connector does a bunch of other work (e.g., starting another timer for the settings timeout) before unreffing the `HandshakeManager`, so it would not trigger this race condition.

Closes #37049

COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37049 from markdroth:chaotic_good_connector_thready_tsan_fix 6821dd1418
PiperOrigin-RevId: 646628702
pull/37062/head
Mark D. Roth 8 months ago committed by Copybara-Service
parent 9bad573c05
commit 6d2ba78bc0
  1. 2
      src/core/ext/transport/chaotic_good/client/chaotic_good_connector.cc
  2. 2
      src/core/ext/transport/chaotic_good/client/chaotic_good_connector.h
  3. 5
      src/core/handshaker/handshaker.cc

@ -76,7 +76,7 @@ const int32_t kTimeoutSecs = 120;
ChaoticGoodConnector::ChaoticGoodConnector(
std::shared_ptr<grpc_event_engine::experimental::EventEngine> event_engine)
: event_engine_(std::move(event_engine)),
handshake_mgr_(std::make_shared<HandshakeManager>()) {}
handshake_mgr_(MakeRefCounted<HandshakeManager>()) {}
ChaoticGoodConnector::~ChaoticGoodConnector() {
CHECK_EQ(notify_, nullptr);

@ -93,7 +93,7 @@ class ChaoticGoodConnector : public SubchannelConnector {
ActivityPtr connect_activity_ ABSL_GUARDED_BY(mu_);
const std::shared_ptr<grpc_event_engine::experimental::EventEngine>
event_engine_;
std::shared_ptr<HandshakeManager> handshake_mgr_;
RefCountedPtr<HandshakeManager> handshake_mgr_;
HPackCompressor hpack_compressor_;
HPackParser hpack_parser_;
absl::BitGen bitgen_;

@ -96,6 +96,11 @@ void HandshakeManager::DoHandshake(
Timestamp deadline, grpc_tcp_server_acceptor* acceptor,
absl::AnyInvocable<void(absl::StatusOr<HandshakerArgs*>)>
on_handshake_done) {
// We hold a ref until after the mutex is released, because we might
// wind up invoking on_handshake_done in another thread before we
// return from this function, and on_handshake_done might release the
// last ref to this object.
auto self = Ref();
MutexLock lock(&mu_);
CHECK_EQ(index_, 0u);
on_handshake_done_ = std::move(on_handshake_done);

Loading…
Cancel
Save