Fix for b/365993761.
Noticed that XdsClient metrics were not being reported due to authority not being properly set.
This solution is not perfect since channels created later can possibly use a different authority, so preferring to use the default authority from the first channel.
Closes#38009
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38009 from yashykt:AddAuthorityToXdsClientMetricsScope 00071efa23
PiperOrigin-RevId: 691149703
There was an edge case in which a socket or endpoint was shut down, a socket `read` call returned zero bytes, and there was unread in the read buffer from a previous read operation. The endpoint callbacks were called with an error status to indicate the end of the stream, and the callbacks did not consume that final chunk of data.
My current hunch is that something inside gRPC is violating the EventEngine Endpoint::Read contract, but I'm not certain what, yet. 88b5c9e3ab/include/grpc/event_engine/event_engine.h (L197-L199)
However, by modifying WindowsEndpoint to return an `absl::OkStatus()` if there's any data in the buffer, tests appear to pass.
Closes#38014
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38014 from drfloob:win-endpoint-data-leak b24b2d9f8a
PiperOrigin-RevId: 691063044
In EventEngineClientChannelDNSResolver, it needs to acquire `on_resolved_mu_` lock when calling dns resolver APIs as well as in on_resolve callback.
In DNSServiceResolverImpl::LookupHostname, it acquires `request_mu_` lock, so the lock order is:
EventEngineClientChannelDNSResolver::on_resolved_mu_ -> DNSServiceResolverImpl::request_mu_
Upon the resolution successful or failed, DNSServiceResolverImpl calls the on_resolved callback without holding any locks, so only one lock here:
EventEngineClientChannelDNSResolver::on_resolved_mu_
However when DNSServiceResolver was deleted, in DNSServiceResolverImpl::Shutdown, it calls the on_resolve callbacks while holding the `request_mu_` lock, as a result the lock order becomes:
DNSServiceResolverImpl::request_mu_ -> EventEngineClientChannelDNSResolver::on_resolved_mu_
which triggers the deadlock check in absl::Mutex as these two locks are acquired in different orders.
This PR release the on_resolved_mu_ first before calling on_resolve, so abls won't complain.
Added a new test which fails with GPR_ABSEIL_SYNC=1 (not enabled by default) without this PR.
Closes#38010
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38010 from HannahShiSFB:fix-lock-invert-in-dns-service-resolver-shutdown 5adc8f32b3
PiperOrigin-RevId: 690781049
`//test/core/end2end/...` tests stopped running back in March 2024, when the linking process for test binaries changed. See https://github.com/grpc/grpc/pull/36197.
The test targets ran on Windows, but zero tests were found inside those targets, so the tests succeeded instantly. This fix results in longer linking steps, and more disk space consumed, but tests are getting discovered now. To illustrate the problem, run `bazel test //test/core/end2end:ping_pong_streaming_test --test_output=all --test_arg=--gtest_list_tests`. Before this PR, zero tests were found. Now:
```
CoreEnd2endTest.
PingPongStreaming1/Inproc
PingPongStreaming1/Chttp2FakeSecurityFullstack
PingPongStreaming1/Chttp2Fullstack
PingPongStreaming1/Chttp2FullstackCompression
PingPongStreaming1/Chttp2FullstackLocalIpv4
...
```
Closes#37918
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37918 from drfloob:fix-win-core-e2e e60f832a5d
PiperOrigin-RevId: 690741492
Also remove a now-unnecessary dependency from the dump_args target that
previously had been needed just to get the right includes to be used.
PiperOrigin-RevId: 690619490
This will resolve a bunch of warnings we're seeing in Envoy, of the form:
```
INFO: From Action external/com_github_grpc_grpc/src/proto/grpc/reflection/v1alpha/reflection.grpc.pb.h:
bazel-out/k8-opt/bin/external/com_github_grpc_grpc/external/com_github_grpc_grpc: warning: directory does not exist.
INFO: From Action external/envoy_api/envoy/service/ext_proc/v3/external_processor.grpc.pb.h:
bazel-out/k8-opt/bin/external/envoy_api/external/envoy_api: warning: directory does not exist.
INFO: From Action external/opencensus_proto/opencensus/proto/agent/trace/v1/trace_service.grpc.pb.h:
bazel-out/k8-opt/bin/external/opencensus_proto/external/opencensus_proto: warning: directory does not exist.
INFO: From Action external/com_google_googleapis/google/devtools/cloudtrace/v2/trace.grpc.pb.h:
bazel-out/k8-opt/bin/external/com_google_googleapis/external/com_google_googleapis: warning: directory does not exist.
INFO: From Action external/com_github_grpc_grpc/src/proto/grpc/reflection/v1/reflection.grpc.pb.h:
bazel-out/k8-opt/bin/external/com_github_grpc_grpc/external/com_github_grpc_grpc: warning: directory does not exist.
```
```
bazel-out/k8-opt/bin/external/com_github_grpc_grpc/external/com_github_grpc_grpc: warning: directory does not exist.
|----copy-1-from-dir_out-----||---copy-2-from-proto_root---|
```
Closes#37990
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37990 from asedeno:warnings 801da4c5cd
PiperOrigin-RevId: 689517449
It's not obvious to me what the reason for the increase in test time is, but it seems that we were already at ~3hr45min earlier and now we are timing out at 4 hours. Increasing to 6 hours.
Closes#37989
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37989 from yashykt:IncreaseWindowsTimeout a0daa97486
PiperOrigin-RevId: 689416681
This will disable the jobs in both bazel and cmake builds, which is necessary for our CI to remain happy. These end2end tests were enabled on Windows at some point without any notice (they had been intentionally disabled for a while), but the RBE jobs have been silently failing for 7 months, so it's unclear when that happened.
These failures still need to be examined so we can re-enable these tests.
Closes#37983
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37983 from drfloob:winguh 64a62fd5b9
PiperOrigin-RevId: 689118231
Possible fix for sporadic flakes like `detected corruption in /Volumes/BuildData/tmpfs/altsrc/github/grpc/workspace_ruby_macos_dbg_native/src/ruby/lib/grpc/grpc_c.bundle` we've been seeing in ruby.
`rake` (invoked by run_ruby.sh) "shouldn't" be modifying the binary if it's already been built (which is the case in these tests) but isn't really guaranteed not to. Running rspec directly ensures that we don't accidentally write to any pre-built artifacts while tests are possibly using them.
A side benefit here is to get better test reporting granularity.
Closes#37975
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37975 from apolcyn:ruby_flake_attempt c03931dfa4
PiperOrigin-RevId: 689064491
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#37973
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37973 from yijiem:core-end2end-windows-hack a2da4ae4eb
PiperOrigin-RevId: 688727248
The default versions to run tests on master using `run_tests.py` are usually set to min and max supported Python versions. Looks like I missed updating the max version to Python 3.13 when adding support recently
Closes#37945
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37945 from sreenithi:add_missing_py313_test_config 708aa70f2b
PiperOrigin-RevId: 688702425
Three problems:
1. We have an owning waker, but on the `Expire` path we never wake it, leading to calls being stranded until the pending timer runs out - instead we now call Finish and have it always wake things up (slightly more expensive in shutdown case, but not on the fast path)
2. Avoid a race condition whereby two threads could wake the same waker
3. Don't add new requests to the pending queue after we've removed all requests
Closes#37972
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37972 from ctiller:flake-fightas-21 2bbd1cf667
PiperOrigin-RevId: 688310530
I hit this crash working on fork support but there's a chance it happens if the file descriptor becomes bad for some other reason.
Closes#37952
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37952 from eugeneo:listener-no-address-crash 1e2e82f8e6
PiperOrigin-RevId: 688238823
Skipping the `tests_aio.unit.channel_ready_test.TestChannelReady.channel_ready_blocked` unit test due to a flake of the test not timing out and raising TimeoutError as expected
Can be reverted once #37949 is investigated and fixed
Closes#37948
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37948 from sreenithi:check_aio_channel_ready_test f48edea5f2
PiperOrigin-RevId: 688061272
This log can be hit under normal-ish circumstances, e.g. if the handshaker service fails to respond or is unreachable. For that reason, it should not be an error log.
Closes#37962
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37962 from matthewstevenson88:downgrade-log 36d4f006a0
PiperOrigin-RevId: 687986132
Porting from #37829.
This ensures that we wait to create the stream to the handshaker service until handshake frames arrive from the client. Without this change, a TCP connection to the ALTS server triggers the stream to the handshaker service to be created, even if no handshake frames have arrived from the client. This waste resources and can potentially trigger the ALTS server to freeze up, because there is a cap on the number of concurrent ALTS handshakes that a server can perform.
Closes#37961
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37961 from matthewstevenson88:alts-fix f8f07e59bb
PiperOrigin-RevId: 687977457
This fixes connecting to a local ATLS server with grpc_cli.
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#37950
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37950 from rainwoodman:patch-1 1172618fb2
PiperOrigin-RevId: 687975188
Use the POSIX code like FreeBSD does.
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#37700
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37700 from 0-wiz-0:master 0846ff5ef0
PiperOrigin-RevId: 687539488
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#37901
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37901 from yijiem:dns-migration-chttp2-server-2 70d29b3b6c
PiperOrigin-RevId: 687466646
Don't complete writes of messages until they make it to the transports outbound loop. Since payloads could be large this introduces just enough pushback that, once #37868 goes in also we should be able to sense when a transport is busy writing and stop sending at higher layers.
Closes#37894
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37894 from ctiller:send-acked 0cb3d7f8ad
PiperOrigin-RevId: 686689473
This is missing in v3 vs v2
- in v2 we had Pipe setup so that multiple Pipe stages could be chained and only complete when the last stage had passed flow control, whereas in v3 the top stage will start accepting requests as soon as the first stage in the pipeline takes the message.
Closes#37868
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37868 from ctiller:drizzling 69209da8a7
PiperOrigin-RevId: 686652402
Passed interop test:
- [x] [grpc/core/master/linux/psm-csm-python](https://source.cloud.google.com/results/invocations/b9ba256b-31a9-4002-bd59-b21817aa9978)
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#37837
PiperOrigin-RevId: 686643728
Currently the destructive reclaimer single threaded cancels existing requests, but we admit new rpcs on every channel (to be eventually cancelled, probably).
We've got evidence that this (shockingly) doesn't scale and senders can easily overwhelm and oom a server.
Instead under this experiment now we'll always reject new work under very high load, and allow the reclaimer to mop up any remaining work to get back to within bounds.
Closes#37927
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/37927 from ctiller:fast_reject 835726473a
PiperOrigin-RevId: 686553599