Builds upon #37765 to support arbitrary connection counts in the transport.
(note: at this point the number of connections is determined at connection establishment - future work will be autotuning this)
Closes#38032
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38032 from ctiller:tiefling-buffer c7520fd7a9
PiperOrigin-RevId: 698952890
The bug occurs in the following fairly specific sequence of events:
1. PF gets a resolver update with two or more addresses. It starts connecting to the first address and starts a Happy Eyeballs timer for 250ms.
- Note that the timer holds a ref to the `SubchannelList`, which is necessary to trigger the bug below. If there was only one address, there would be no Happy Eyeballs timer holding a ref here, so the bug would not occur.
2. The first subchannel reports CONNECTING and is seen by the LB policy.
3. The first subchannel reports READY, and the notification hops into the WorkSerializer but has not yet been executed.
4. The timer fires, and the timer callback hops into the WorkSerializer but has not yet been executed.
5. The LB policy gets shut down. This shuts down the `SubchannelList`, but we fail to actually shut down the underlying `SubchannelState`.
- This is the bug! We *should* be shutting down the `SubchannelState` here.
- Note that if the pending timer callback were not holding a ref to the `SubchannelList`, then the bug would not occur: the `SubchannelList` would have been immediately destroyed, which *would* have shut down the `SubchannelState`. In particular, note that if the timer had not yet fired, shutting down the `SubchannelList` would cancel the timer, thus releasing the ref immediately and shutting down the `SubchannelState`. Similarly, if the timer callback had already been seen by the LB policy, then the ref would also no longer be held.
6. The LB policy now sees the READY notification. This should be a no-op, since PF has already been shut down. However, because the `SubchannelState` was not shut down, it selects the subchannel instead.
7. The LB policy now sees the timer fire. This becomes a no-op, but it releases the ref to the `SubchannelList`, thus causing the `SubchannelList` to be destroyed. However, the `SubchannelState` for the selected subchannel from the previous step is no longer owned by the `SubchannelList`, so it is not shut down.
8. The selected subchannel now reports IDLE. This causes PF to call `GoIdle()`, and at this point we are holding the last ref to the LB policy, which we try to access after giving up that ref, thus causing a crash.
- Note that we're not actually holding this ref in order to keep the LB policy alive at this point; the ref actually exists only due to some [tech debt](14e077f9bd/src/core/load_balancing/pick_first/pick_first.cc (L196)). We should never be executing this code path to begin with after PF has been shut down, so we shouldn't need that ref.
Closes#38144
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38144 from markdroth:pick_first_new_fix 4ec9f9ea1d
PiperOrigin-RevId: 698807898
These corpora entries helped isolate a number of bugs in cancel_after_invoke.
By themselves right now I don't expect them to do much, but I want to seed our upstream fuzzers with this data so that we can find new examples in the future.
Closes#38132
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38132 from ctiller:flake-fightas-23 9f6ae727f8
PiperOrigin-RevId: 697064247
If we close reads on an mpsc then readers should also fail - not doing so can open the way for some weird stuck bugs
Closes#38138
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38138 from ctiller:flake-fightas-29 8bc61601be
PiperOrigin-RevId: 697023583
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#38128
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38128 from yashykt:TestMetricstestFlakiness 4ac65b2d80
PiperOrigin-RevId: 697000317
Update the chaotic-good wire format with some learnings from the past year, and set up things for the next round of changes we'd like to make:
* Instead of a composite FRAGMENT frame, split out CLIENT_INITIAL_METADATA, CLIENT_END_OF_STREAM, MESSAGE, SERVER_INITIAL_METADATA, SERVER_TRAILING_METADATA as separate frame types - this eliminates a ton of complexity in the transport, and corresponds to how we used the wire format in practice anyway.
* Switch the frame payload for metadata, settings to be protobuf instead of HPACK - this eliminates the ordering requirements on interpreting these frames between streams, which I expect to open up some flexibility with head of line avoidance in the future. It's a heck of a lot easier to read and reason about the code. It's also easier to predict the size of the frame at encode time, which lets us treat metadata and payloads more uniformly in the protocol.
* Add a connection id field to our header, in preparation for allowing multiple data connections
* Allow payloads to be shipped on the control channel ('connection id 0') and use this for sending small messages
Closes#37765
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37765 from ctiller:tiefling 7b57f72367
PiperOrigin-RevId: 695766541
Speculative attempt to fix a failing test. Hypothesis: UB on destroyed buffer when the read callbacks were executed.
Closes#38085
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38085 from drfloob:iocp-test-cleanup ae89836762
PiperOrigin-RevId: 694187408
This removes all xDS protos except for 5 of them that have services. We still have some limitations in our internal build system that make it hard to use the real xDS protos for those files, but we're now using the real xDS protos for the rest.
(Note: discovery.proto is actually a special case. While it does have services, we don't actually use those services, so that's not the reason we need a copy of this file. Unfortunately, the xDS BUILD files group discovery.proto into the same build target as ads.proto, which has services that we actually use, thus requiring us to have our own copy. This means that depending on the real discovery.proto causes us to also depend on the real ads.proto, which causes a conflict in the protobuf registry by linking two copies of ads.proto. However, we *are* using the real discovery.proto in unit tests, which do not depend on ads.proto.)
PiperOrigin-RevId: 693907782
This change upgrades the sanity test to use Clang 19, including clang-format and clang-tidy. (It's a partial implementation of the changes proposed in #38038)
Key updates:
- Docker images now utilize Clang 19.
- Code has been reformatted using the updated clang-format.
- Resolved `readability-math-missing-parentheses` warnings raised by clang-tidy.
Note that the other part of the clang-19 upgrade, "using clang-19 for C++ test" will be done once opentelemetry-cpp fixes the clang-19 build error.
Closes#38070
PiperOrigin-RevId: 693833548
We're about to completely change the wire format here... land one additional copy of the transport and tests as a hedge against bugs. Enable the hedge with an experiment.
Closes#38026
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38026 from ctiller:legacy-admission 5a32bb105d
PiperOrigin-RevId: 692984545
Improve metadata redaction comment to help people who are seeing the redaction statement in their logs.
Closes#38033
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38033 from tanvi-jagtap:improve_redaction_comment 18ba7e18c9
PiperOrigin-RevId: 692906001
[PH2][NewFile][ClassStructure][Important] Add client and server class
1. New classes Http2ServerTransport and Http2ClientTransport
2. Similar to the classes in [Chaotic Good Client Transport](https://github.com/grpc/grpc/blob/master/src/core/ext/transport/chaotic_good/client_transport.h) and [Chaotic Good Server Transport](https://github.com/grpc/grpc/blob/master/src/core/ext/transport/chaotic_good/server_transport.h)
3. Added new Test files. For now, the 2 new tests just call the constructor of Http2ServerTransport and Http2ClientTransport.
Tested locally using
```
CC=cc bazel test --test_output=all -c dbg --config=asan --verbose_failures //test/core/transport/chttp2:http2_client_transport_test
```
```
CC=cc bazel test --test_output=all -c dbg --config=asan --verbose_failures //test/core/transport/chttp2:http2_server_transport_test
```
Closes#37840
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37840 from tanvi-jagtap:ph2_add_client_server_class c6c3a0d5fb
PiperOrigin-RevId: 692824127
There was an edge case in which a socket or endpoint was shut down, a socket `read` call returned zero bytes, and there was unread in the read buffer from a previous read operation. The endpoint callbacks were called with an error status to indicate the end of the stream, and the callbacks did not consume that final chunk of data.
My current hunch is that something inside gRPC is violating the EventEngine Endpoint::Read contract, but I'm not certain what, yet. 88b5c9e3ab/include/grpc/event_engine/event_engine.h (L197-L199)
However, by modifying WindowsEndpoint to return an `absl::OkStatus()` if there's any data in the buffer, tests appear to pass.
Closes#38014
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38014 from drfloob:win-endpoint-data-leak b24b2d9f8a
PiperOrigin-RevId: 691063044
In EventEngineClientChannelDNSResolver, it needs to acquire `on_resolved_mu_` lock when calling dns resolver APIs as well as in on_resolve callback.
In DNSServiceResolverImpl::LookupHostname, it acquires `request_mu_` lock, so the lock order is:
EventEngineClientChannelDNSResolver::on_resolved_mu_ -> DNSServiceResolverImpl::request_mu_
Upon the resolution successful or failed, DNSServiceResolverImpl calls the on_resolved callback without holding any locks, so only one lock here:
EventEngineClientChannelDNSResolver::on_resolved_mu_
However when DNSServiceResolver was deleted, in DNSServiceResolverImpl::Shutdown, it calls the on_resolve callbacks while holding the `request_mu_` lock, as a result the lock order becomes:
DNSServiceResolverImpl::request_mu_ -> EventEngineClientChannelDNSResolver::on_resolved_mu_
which triggers the deadlock check in absl::Mutex as these two locks are acquired in different orders.
This PR release the on_resolved_mu_ first before calling on_resolve, so abls won't complain.
Added a new test which fails with GPR_ABSEIL_SYNC=1 (not enabled by default) without this PR.
Closes#38010
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38010 from HannahShiSFB:fix-lock-invert-in-dns-service-resolver-shutdown 5adc8f32b3
PiperOrigin-RevId: 690781049
`//test/core/end2end/...` tests stopped running back in March 2024, when the linking process for test binaries changed. See https://github.com/grpc/grpc/pull/36197.
The test targets ran on Windows, but zero tests were found inside those targets, so the tests succeeded instantly. This fix results in longer linking steps, and more disk space consumed, but tests are getting discovered now. To illustrate the problem, run `bazel test //test/core/end2end:ping_pong_streaming_test --test_output=all --test_arg=--gtest_list_tests`. Before this PR, zero tests were found. Now:
```
CoreEnd2endTest.
PingPongStreaming1/Inproc
PingPongStreaming1/Chttp2FakeSecurityFullstack
PingPongStreaming1/Chttp2Fullstack
PingPongStreaming1/Chttp2FullstackCompression
PingPongStreaming1/Chttp2FullstackLocalIpv4
...
```
Closes#37918
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37918 from drfloob:fix-win-core-e2e e60f832a5d
PiperOrigin-RevId: 690741492
This will disable the jobs in both bazel and cmake builds, which is necessary for our CI to remain happy. These end2end tests were enabled on Windows at some point without any notice (they had been intentionally disabled for a while), but the RBE jobs have been silently failing for 7 months, so it's unclear when that happened.
These failures still need to be examined so we can re-enable these tests.
Closes#37983
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37983 from drfloob:winguh 64a62fd5b9
PiperOrigin-RevId: 689118231
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#37973
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37973 from yijiem:core-end2end-windows-hack a2da4ae4eb
PiperOrigin-RevId: 688727248
Porting from #37829.
This ensures that we wait to create the stream to the handshaker service until handshake frames arrive from the client. Without this change, a TCP connection to the ALTS server triggers the stream to the handshaker service to be created, even if no handshake frames have arrived from the client. This waste resources and can potentially trigger the ALTS server to freeze up, because there is a cap on the number of concurrent ALTS handshakes that a server can perform.
Closes#37961
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37961 from matthewstevenson88:alts-fix f8f07e59bb
PiperOrigin-RevId: 687977457
Don't complete writes of messages until they make it to the transports outbound loop. Since payloads could be large this introduces just enough pushback that, once #37868 goes in also we should be able to sense when a transport is busy writing and stop sending at higher layers.
Closes#37894
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37894 from ctiller:send-acked 0cb3d7f8ad
PiperOrigin-RevId: 686689473
This is missing in v3 vs v2
- in v2 we had Pipe setup so that multiple Pipe stages could be chained and only complete when the last stage had passed flow control, whereas in v3 the top stage will start accepting requests as soon as the first stage in the pipeline takes the message.
Closes#37868
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37868 from ctiller:drizzling 69209da8a7
PiperOrigin-RevId: 686652402
To prepare for the upcoming upgrade to C++17, the following changes were made:
Increased minimum supported operating system versions:
- iOS: 11 (previously 10)
- macOS: 10.14 (previously 10.12)
- tvOS: 13.0 (previously 12.0)
In addition to this, version requirements across different projects were updated to use these for consistency.
Closes#37931
PiperOrigin-RevId: 686519641
This migrates all of the xDS unit tests except for the fuzzer, which I'll get in a subsequent PR.
This also does not include the xDS e2e tests, which I will also do separately.
Closes#37896
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37896 from markdroth:xds_tests_use_real_protos2 de568b4e53
PiperOrigin-RevId: 686197812
This eliminates the need for the `grpc_cc_proto_library` bazel BUILD rule introduced in #37863.
To make this work, I had to upgrade several bazel dependencies and apply a patch to rules_go to work around https://github.com/bazelbuild/bazel/issues/11636.
Closes#37902
PiperOrigin-RevId: 685868647
We'll probably disable some next week :)
But I want to watch a good selection and refine criteria for acceptance.
Closes#37903
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37903 from ctiller:ALL-the-things 5f829db870
PiperOrigin-RevId: 685010911
So far missing for HTTP/2 style flow control has been a primitive to query whether there's a receiver for flow control data at the other end of the message pipes.
Here I'm updating the state machine accessors to accommodate that functionality.
No new states were needed.
Whilst here, document the current member functions on `CallState`.
Closes#37867
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37867 from ctiller:like-the-river c9814c737d
PiperOrigin-RevId: 684972125
This is a trial baloon to see if we can actually make this work. If it does, I'll change the remaining xDS tests to use the real xDS protos and completely remove our local copies.
Closes#37863
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37863 from markdroth:xds_tests_use_real_protos 3ad2fe12be
PiperOrigin-RevId: 684877750
I'm continuing to look into some flakes here, but in the meantime these shouldn't halt submissions. Marking them flaky.
Closes#37880
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37880 from ctiller:mark-flaky 27427c7978
PiperOrigin-RevId: 684526341