Builds upon #37765 to support arbitrary connection counts in the transport.
(note: at this point the number of connections is determined at connection establishment - future work will be autotuning this)
Closes#38032
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38032 from ctiller:tiefling-buffer c7520fd7a9
PiperOrigin-RevId: 698952890
The bug occurs in the following fairly specific sequence of events:
1. PF gets a resolver update with two or more addresses. It starts connecting to the first address and starts a Happy Eyeballs timer for 250ms.
- Note that the timer holds a ref to the `SubchannelList`, which is necessary to trigger the bug below. If there was only one address, there would be no Happy Eyeballs timer holding a ref here, so the bug would not occur.
2. The first subchannel reports CONNECTING and is seen by the LB policy.
3. The first subchannel reports READY, and the notification hops into the WorkSerializer but has not yet been executed.
4. The timer fires, and the timer callback hops into the WorkSerializer but has not yet been executed.
5. The LB policy gets shut down. This shuts down the `SubchannelList`, but we fail to actually shut down the underlying `SubchannelState`.
- This is the bug! We *should* be shutting down the `SubchannelState` here.
- Note that if the pending timer callback were not holding a ref to the `SubchannelList`, then the bug would not occur: the `SubchannelList` would have been immediately destroyed, which *would* have shut down the `SubchannelState`. In particular, note that if the timer had not yet fired, shutting down the `SubchannelList` would cancel the timer, thus releasing the ref immediately and shutting down the `SubchannelState`. Similarly, if the timer callback had already been seen by the LB policy, then the ref would also no longer be held.
6. The LB policy now sees the READY notification. This should be a no-op, since PF has already been shut down. However, because the `SubchannelState` was not shut down, it selects the subchannel instead.
7. The LB policy now sees the timer fire. This becomes a no-op, but it releases the ref to the `SubchannelList`, thus causing the `SubchannelList` to be destroyed. However, the `SubchannelState` for the selected subchannel from the previous step is no longer owned by the `SubchannelList`, so it is not shut down.
8. The selected subchannel now reports IDLE. This causes PF to call `GoIdle()`, and at this point we are holding the last ref to the LB policy, which we try to access after giving up that ref, thus causing a crash.
- Note that we're not actually holding this ref in order to keep the LB policy alive at this point; the ref actually exists only due to some [tech debt](14e077f9bd/src/core/load_balancing/pick_first/pick_first.cc (L196)). We should never be executing this code path to begin with after PF has been shut down, so we shouldn't need that ref.
Closes#38144
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38144 from markdroth:pick_first_new_fix 4ec9f9ea1d
PiperOrigin-RevId: 698807898
By itself this is a no-op, but a future change will leverage this to allow fuzzers to inject thread hops into party activations (a technique that has helped find multiple log lived bugs in the past 24 hours)
Closes#38139
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38139 from ctiller:flake-fightas-30 e19a1af694
PiperOrigin-RevId: 697620027
VLOG is probably the wrong thing here (considering it's been requested explicitly via a trace)
Closes#38135
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38135 from ctiller:flake-fightas-26 52a78995d2
PiperOrigin-RevId: 697067177
If we close reads on an mpsc then readers should also fail - not doing so can open the way for some weird stuck bugs
Closes#38138
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38138 from ctiller:flake-fightas-29 8bc61601be
PiperOrigin-RevId: 697023583
Fix https://github.com/grpc/grpc/issues/37969.
There is an inverted length check in GrpcPolledFdWindows before memcpying from gRPC's `recv_from_source_addr_` into c-ares' socket address structure. In newer c-ares version, it changed to use `struct sockaddr_storage` for the socket address which is 128 bytes and hit this issue.
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#38101
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38101 from yijiem:37969 282fc8269e
PiperOrigin-RevId: 696607100
Just used this to find out we always do a tcp write for client initial metadata prior to payload
Closes#38053
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38053 from ctiller:party-see 6b5a2ba6cf
PiperOrigin-RevId: 696371772
Update the chaotic-good wire format with some learnings from the past year, and set up things for the next round of changes we'd like to make:
* Instead of a composite FRAGMENT frame, split out CLIENT_INITIAL_METADATA, CLIENT_END_OF_STREAM, MESSAGE, SERVER_INITIAL_METADATA, SERVER_TRAILING_METADATA as separate frame types - this eliminates a ton of complexity in the transport, and corresponds to how we used the wire format in practice anyway.
* Switch the frame payload for metadata, settings to be protobuf instead of HPACK - this eliminates the ordering requirements on interpreting these frames between streams, which I expect to open up some flexibility with head of line avoidance in the future. It's a heck of a lot easier to read and reason about the code. It's also easier to predict the size of the frame at encode time, which lets us treat metadata and payloads more uniformly in the protocol.
* Add a connection id field to our header, in preparation for allowing multiple data connections
* Allow payloads to be shipped on the control channel ('connection id 0') and use this for sending small messages
Closes#37765
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37765 from ctiller:tiefling 7b57f72367
PiperOrigin-RevId: 695766541
- Adding two experiments for promises based HTTP2 transport.
- We have kept client and server transport experiments separate to help with smoother roll outs and also help with interop testing.
- The experiments are disabled, we expect this project to take several months.
Closes#38103
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38103 from tanvi-jagtap:client_server_transport_experiment 53a24bda04
PiperOrigin-RevId: 695606023
This fixes b/323916594. In some flaky cases, a skipped test seems to be causing an unnecessary segfault at the test shutdown. This test is no longer relevant in newer version of PHP. It had been skipped for a while already.
Closes#38090
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38090 from ajinkyakulkarni75:skipped-test 2d159b4ffb
PiperOrigin-RevId: 694567089
Fix https://github.com/grpc/grpc/issues/37742.
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#38069
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38069 from yijiem:37742 064c437b8d
PiperOrigin-RevId: 694305299
This removes all xDS protos except for 5 of them that have services. We still have some limitations in our internal build system that make it hard to use the real xDS protos for those files, but we're now using the real xDS protos for the rest.
(Note: discovery.proto is actually a special case. While it does have services, we don't actually use those services, so that's not the reason we need a copy of this file. Unfortunately, the xDS BUILD files group discovery.proto into the same build target as ads.proto, which has services that we actually use, thus requiring us to have our own copy. This means that depending on the real discovery.proto causes us to also depend on the real ads.proto, which causes a conflict in the protobuf registry by linking two copies of ads.proto. However, we *are* using the real discovery.proto in unit tests, which do not depend on ads.proto.)
PiperOrigin-RevId: 693907782
The target `:default_event_engine_factory` currently uses gRPC specific config_settings which bundle the CPU with the OS (e.g. `cpu: windows_x86_64`). Use Bazel's OS constraint as one of the select cases so the correct target is used when settings `os: windows` as a constraint.
PiperOrigin-RevId: 693890511
We've got a customer that's seeing some failures right now and are stuck debugging because we don't have sufficient log visibility.
Closes#38065
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38065 from ctiller:loggy 8df1d8a4bb
PiperOrigin-RevId: 693879687
ResolvedAddrToUnixPathIfPossible is only called when GRPC_HAVE_UNIX_SOCKET is defined, so there's no need to define that function when it isn't.
Closes#38016
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38016 from hferreiro:master 3d00260104
PiperOrigin-RevId: 693833757
This change upgrades the sanity test to use Clang 19, including clang-format and clang-tidy. (It's a partial implementation of the changes proposed in #38038)
Key updates:
- Docker images now utilize Clang 19.
- Code has been reformatted using the updated clang-format.
- Resolved `readability-math-missing-parentheses` warnings raised by clang-tidy.
Note that the other part of the clang-19 upgrade, "using clang-19 for C++ test" will be done once opentelemetry-cpp fixes the clang-19 build error.
Closes#38070
PiperOrigin-RevId: 693833548
`//src/python/grpcio_tests/tests/unit:_contextvars_propagation_test` is very flaky, mainly in two ways:
1. Failing with error `Error in bind for address '/tmp/grpc_fullstack_test.sock': Address already in use`.
2. Failing with timeout without any error.
#### Address already in use error
This is because we're reusing the same path for all test cases: 5011420f16/src/python/grpcio_tests/tests/unit/_contextvars_propagation_test.py (L31)
#### Timeout error
We're deleting tmp file after test is done:
5011420f16/src/python/grpcio_tests/tests/unit/_contextvars_propagation_test.py (L64-L66)
This might cause Core fail to connect to channel with error: `connect failed: addr: unix:/tmp/grpc_fullstack_test.sock error: No such file or directory`, Core will keep retrying and thus causing the test to timeout.
To make things worse, we're using multiple threads in one of the test case, leading to an even higher rate of flakiness.
This PR fix the issue by using different address for different test runs.
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#38076
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38076 from XuanWang-Amos:fix_contextvar_test 93ab2b350f
PiperOrigin-RevId: 693812629
This log can be hit under normal circumstances (e.g. a client has an expired cert and authenticates to the server), so this should be an INFO-level log rather than an ERROR-level log.
Closes#38058
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38058 from matthewstevenson88:downgrade23 1cbdd5a3e7
PiperOrigin-RevId: 693375018
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#38056
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38056 from yijiem:ee-dns-non-client-channel-chaotic 74c3b20731
PiperOrigin-RevId: 693112346
We're about to completely change the wire format here... land one additional copy of the transport and tests as a hedge against bugs. Enable the hedge with an experiment.
Closes#38026
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38026 from ctiller:legacy-admission 5a32bb105d
PiperOrigin-RevId: 692984545
Improve metadata redaction comment to help people who are seeing the redaction statement in their logs.
Closes#38033
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38033 from tanvi-jagtap:improve_redaction_comment 18ba7e18c9
PiperOrigin-RevId: 692906001
[PH2][NewFile][ClassStructure][Important] Add client and server class
1. New classes Http2ServerTransport and Http2ClientTransport
2. Similar to the classes in [Chaotic Good Client Transport](https://github.com/grpc/grpc/blob/master/src/core/ext/transport/chaotic_good/client_transport.h) and [Chaotic Good Server Transport](https://github.com/grpc/grpc/blob/master/src/core/ext/transport/chaotic_good/server_transport.h)
3. Added new Test files. For now, the 2 new tests just call the constructor of Http2ServerTransport and Http2ClientTransport.
Tested locally using
```
CC=cc bazel test --test_output=all -c dbg --config=asan --verbose_failures //test/core/transport/chttp2:http2_client_transport_test
```
```
CC=cc bazel test --test_output=all -c dbg --config=asan --verbose_failures //test/core/transport/chttp2:http2_server_transport_test
```
Closes#37840
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37840 from tanvi-jagtap:ph2_add_client_server_class c6c3a0d5fb
PiperOrigin-RevId: 692824127