The `cancel_check_peer()` method is [always called with a non-OK
status](866fc41067/src/core/lib/security/transport/security_handshaker.cc (L560)),
since it's used only in cancellation cases. However, the implementation
of this method for TLS creds was bailing out if the status was non-OK,
meaning that `cancel_check_peer()` was never actually cancelling the
verification request. This bug seems to have been introduced back in
#25631, when the method was initially implemented.
I don't think we actually have any async verifier implementations today,
so this isn't actually causing a problem. I discovered this bug as part
of #34426, which was triggering the core e2e `no_logging` test to fail.
That test is designed to ensure that we don't generate any logs while
processing individual RPCs, since that would be bad for performance and
would flood logfiles. My PR caused a connection attempt to be cancelled
during the test, which triggered the error log that I am removing in
this PR.
Note that with this PR, the TLS creds `cancel_check_peer()` methods are
not actually doing anything with the status. Ideally, they should be
passing the status through to the verifier's `Cancel()` method, but we
apparently didn't add a parameter for that, which means that although
cancellation will work now, it will not properly pass through the right
error message. At some point, we should fix this and add tests covering
cancellation of async verifier requests to prove that the error message
is propagated correctly.
Bazelify tests from "linux/grpc_bazel_build" kokoro job by creating 3
bazelified tests - "build with strict warning", "build with no_xds=True"
and "build with no_xds=True negative test".
- also make the original "linux/grpc_bazel_build" kokoro job a no-op
(since bazelified tests now provide the same coverage).
The deleted code here was overriding the
[intended](866fc41067/tools/run_tests/run_tests.py (L62))
default test env of `GRPC_VERBOSITY=DEBUG`.
I'm just deleting it because it looks like`GRPC_TRACE=api` is not having
any affect anyways, since it relies on `GRPC_VERBOSITY=DEBUG` which it
happens to be unsetting.
If the client calls LookupHostname again within the on_resolve callback,
it re-acquires the `request_mu_` before releasing it which results in
deadlock.
With this PR it extracts the request and releases the lock before
calling on_resolve callback so it won't deadlock any more.
https://github.com/grpc/grpc/pull/33538 added `-weak_framework
CoreFoundation` in `DLDFLAGS` for only `arm64-darwin` builds, but the
issue reported in https://github.com/grpc/grpc/issues/33483 can also
happen on `x86-darwin` builds. This can happen if:
1. The Ruby interpreter is compiled without
`-Wl,-undefined,dynamic_lookup`.
2. This happens if the Ruby interpreter is built with XCode 14.0 to 14.2
(https://bugs.ruby-lang.org/issues/19005).
Simplify the logic and always include `-weak_framework CoreFoundation`
for macOS builds.
Changes -
* Remove `csm.remote_workload_pod_name` and
`csm.remote_workload_container_name`.
* Add `csm.remote_workload_name`, the value for which is sent through
MetadataExchange, from the `CSM_WORKLOAD_NAME` env var. (Note that this
is not added in local labels.)
* Add a local `csm.canonical_service` (@markdroth, please verify the key
that we want here) that is read from `CSM_CANONICAL_SERVICE_NAME` env
var, and we continue to send it over via MetadataExchange
- make C-core basictests use `--build_only` when running as bazelified
tests. This is because the volume of C core tests is expected to grow
very significantly after https://github.com/grpc/grpc/pull/34419 and
currently the non-bazelified counterpart of the tests (the presubmit
grpc_basictests_c_cpp_build_only job) is also "build only".
- make the linux presubmit job `grpc_basictests_c_cpp_build_only` a
noop, since the bazelified tests already give the same coverage on
presubmit.
Revert the reversion of the SSL_CTX_new change (#34355 reverted #34180 )
with a fix.
There was an issue with using `strcpy` on a `new[] string` in the
constructor of `ssl_credentials`. An ASAN test caught this in some CI
down the line - `ERROR: AddressSanitizer: alloc-dealloc-mismatch
(operator new [] vs free)`
That `strcpy` call was changed to `grp_strdup` which duplicates a string
in a way that can be freed by `gpr_free` and should resolve the ASAN
failure.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Added a separate distribtests for gRPC C++ DLL build on Windows. This
DLL build is a community support so it should be independently run from
the existing Windows distribtests. Actual DLL test will be added.
We're seeing some reports of the ping abuse policy not working like it
ought... add some tracing here to debug.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
This pull request adds another hook service on the maintenance server.
This will enable clients to gradually migrate from the standalone hook
server.
Changes:
1. Hook service can now be used separately.
2. Copied latest protos and updated the hook service to new API.
3. Added the hook service to the maintenance server.
Working towards testing against CSM Observability. Added ability to
register a prometheus exporter with our Opentelemetry plugin. This will
allow our metrics to be available at the standard prometheus port
`:9464`.
This fixes a deadlock seen when both the
`round_robin_delegate_to_pick_first` and
`client_channel_subchannel_wrapper_work_serializer_orphan` experiments
are enabled -- although I think the bug really has to do only with the
latter.
The problem here was that we were unreffing the picker while holding the
channel's LB mutex. Destroying the picker was triggering a
`SubchannelWrapper` to be orphaned, which triggered a hop into the
`WorkSerializer`. Once there, we were also running a queued subchannel
connectivity state notification, which triggered an update to the LB
policy, which triggered returning a new picker to the channel, which
tried to acquire the channel's LB mutex again.
Note that the `work_serializer_dispatch` experiment would have avoided
this problem.
grpc 1.57.0 crashes win ruby and alpine due to no `strdup` in musl libc.
This diff replace `strdup` with `grp_strdup`
```
Thread 1 "ruby" received signal SIGSEGV, Segmentation fault.
0x00000000000a4596 in ?? ()
(gdb) bt
#0 0x00000000000a4596 in ?? ()
#1 0x00007ffff14e298c in grpc_rb_channel_create_in_process_add_args_hash_cb (key=<optimized out>, val=<optimized out>, args_obj=<optimized out>) at rb_channel_args.c:84
#2 0x00007ffff7c2b9ea in hash_ar_foreach_iter (error=0, argp=140737488344784, value=<optimized out>, key=<optimized out>) at hash.c:1341
```
fixes#34044closes#27995
Since many tests now run reliably as bazelified tests on RBE, we can
remove them from presubmit runs
to speedup testing of PRs.
(for now, these jobs will still run on master, they can be removed from
master as a followup).
- linux/grpc_distribtests_standalone is now fully covered by bazel test
suite
a3b4c797a7/tools/bazelify_tests/test/BUILD (L202),
setting them to `presubmit=False` will stop tests from running on PRs.
- stop running tests from grpc_bazel_distribtest on PR, instead rely on
bazel distribtests running as bazelified tests.
We have a bunch of experiments testing against core e2e - and this is
good for robustness, bad for CI times.
We also have a bunch of marginal but overall necessary fixtures in the
e2e suites - again good for robustness, bad for CI times.
We can eliminate some of the cross product though, and I think safely:
run experiments on a broad range of suites, but not *ALL* the suites,
and get a bunch of our CI time back.
Here I introduce an environment variable: `GRPC_CI_EXPERIMENTS` that's
set when running bazel @experiment= configs, cleared otherwise (so we
can still execute those tests directly when necessary). When that env
var is set we filter out a bunch of suites from the test configurations.
This is just an initial scope of tests. Much of this code was written by
@ginayeh . I just did the final polish/integration step.
There are 3 main tests included:
1. The GAMMA baseline test, including the [actual GAMMA
API](https://gateway-api.sigs.k8s.io/geps/gep-1426/) rather than vendor
extensions.
2. Kubernetes-based stateful session affinity tests, where the mesh
(including SSA configuration) is configured using CRDs
3. GCP-based stateful session affinity tests, where the mesh is
configured using the networkservices APIs directly
Tests 1 and 2 will run in both prod and GKE staging, i.e.
`container.googleapis.com` and
`staging-container.sandbox.googleapis.com`. The latter of these will act
as an early detection mechanism for regressions in the controller that
translates Gateway resources into networkservices resources.
Test 3 will run against `staging-networkservices.sandbox.googleapis.com`
to act as an early detection mechanism for regressions in the control
plane SSA implementation.
The scope of the SSA tests is still fairly minimal. Session drain
testing is in-progress but not included in this PR, though several
elements required for it are (grace period, pre-stop hook, and the
ability to kill a single pod in a deployment).
---------
Co-authored-by: Jung-Yu (Gina) Yeh <ginayeh@google.com>
Co-authored-by: Sergii Tkachenko <sergiitk@google.com>
Add some basic metrics to work serializer, keep them process wide for
now (though it may be interesting to get these into channelz in the
future).
Collected are:
- time spent running a work serializer when it starts
- time spent actually executing work when the work serializer runs
- number of items executed each run
A high disparity between the first two indicates our dispatching
mechanism is adding large amounts of latency (perhaps due to thread
starvation like effects).
A high value for any of these indicate contention on the serializer.
It's likely a future iteration on these will select different metrics -
I'm not *entirely* sure which will be useful in production analysis yet.
I'm using `std::chrono::steady_clock` here for precision (nanoseconds)
with a compact representation (better than timespec) and a robust &
portable api - I think it's appropriate for metrics, but wouldn't use it
much beyond that at this point.
The one in xds_override_host was the one that was actually triggering
test failures, but I audited all of the other policies and fixed a
couple of other places that could also be problematic.
This test assumed synchronous work serializer execution (or at least
faster async than we always get)... make a trivial change to keep the
test semantics but allow for the implementation to be more async.
This reverts commit 2db446aa9a.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Original PR was #34307, reverted in #34318 due to internal test
failures.
The first commit is a revert of the revert. The second commit contains
the fix.
The original idea here was that `SubchannelWrapper::Orphan()`, which is
called when the strong refcount reaches 0, would take a new weak ref and
then hop into the `WorkSerializer` before dropping that weak ref, thus
ensuring that the `SubchannelWrapper` is destroyed inside the
`WorkSerializer` (which is needed because the `SubchannelWrapper` dtor
cleans up some state in the channel related to the subchannel). The
problem is that `DualRefCounted<>::Unref()` itself actually increments
the weak ref count before calling `Orphan()` and then decrements it
afterwards. So in the case where the `SubchannelWrapper` is unreffed
outside of the `WorkSerializer` and no other thread happens to be
holding the `WorkSerializer`, the weak ref that we were taking in
`Orphan()` was unreffed inline, which meant that it wasn't actually the
last weak ref -- the last weak ref was the one taken by
`DualRefCounted<>::Unref()`, and it wasn't released until after the
`WorkSerializer` was released.
To this this problem, we move the code from the `SubchannelWrapper` dtor
that cleans up the channel's state into the `WorkSerializer` callback
that is scheduled in `Orphan()`. Thus, regardless of whether or not the
last weak ref is released inside of the `WorkSerializer`, we are
definitely doing that cleanup inside the `WorkSerializer`, which is what
we actually care about.
Also adds an experiment to guard this behavior.