Summary -
On the server-side, we are changing the point at which we decide whether
a method is registered or not from the surface to the transport at the
point where we are done receiving initial metadata and before we invoke
the recv_initial_metadata_ready closures from the filters. The main
motivation for this is to allow filters to check whether the incoming
method is a registered or not. The exact use-case is for observability
where we only want to record the method if it is registered. We store
the information about the registered method in the initial metadata.
On the client-side, we also set information about whether the method is
registered or not in the outgoing initial metadata.
Since we are effectively changing the lookup point of the registered
method, there are slight concerns of this being a potentially breaking
change, so we are guarding this with an experiment to be safe.
Changes -
* Transport API changes -
* Along with `accept_stream_fn`, a new callback
`registered_method_matcher_cb` will be sent down as a transport op on
the server side. When initial metadata is received on the server side,
this callback is invoked. This happens before invoking the
`recv_initial_metadata_ready` closure.
* Metadata changes -
* We add a new non-serializable metadata trait `GrpcRegisteredMethod()`.
On the client-side, the value is a uintptr_t with a value of 1 if the
call has a registered/known method, or 0, if it's not known. On the
server side, the value is a (ChannelRegisteredMethod*). This metadata
information can be used throughout the stack to check whether a call is
registered or not.
* Server Changes -
* When a new transport connection is accepted, the server sets
`registered_method_matcher_cb` along with `accept_stream_fn`. This
function checks whether the method is registered or not and sets the
RegisteredMethod matcher in the metadata for use later.
* Client Changes -
* Set the metadata on call creation on whether the method is registered
or not.
Bumping gcc 7 to 8 to workaround the ongoing gcc segfault problem when
building Protobuf C++. Currently Foundational C++ requires gcc 7 so this
is a temporary measure to make the test green. We need to either make a
decision to change the minimum version of gcc in the Foundational C++ or
find a way to support gcc 7 without gcc segfault soon.
Changes -
* CsmObservability doesn't need `SetTargetSelector`. Removed it.
* Added missing plumbing of `ServiceMeshLabelsInjector` in
`CsmObservability` to actually do the metadata exchange.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Changes -
* Use `grpc-c++` as the meter name and the proper version string when
creating the meter for OTel
* setting metric description and unit. (Pointed out by @DNVindhya )
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Lets us sever the dependency between stats & exec ctx (finally).
More work likely needs to go into the *mechanism* used here (I'm not a
fan of the per thread index), but that's also something we can address
later.
Support Python 3.12.
### Testing
* Passed all Distribution Tests.
* Also tested locally by installing 3.12 artifact.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
The `cancel_check_peer()` method is [always called with a non-OK
status](866fc41067/src/core/lib/security/transport/security_handshaker.cc (L560)),
since it's used only in cancellation cases. However, the implementation
of this method for TLS creds was bailing out if the status was non-OK,
meaning that `cancel_check_peer()` was never actually cancelling the
verification request. This bug seems to have been introduced back in
#25631, when the method was initially implemented.
I don't think we actually have any async verifier implementations today,
so this isn't actually causing a problem. I discovered this bug as part
of #34426, which was triggering the core e2e `no_logging` test to fail.
That test is designed to ensure that we don't generate any logs while
processing individual RPCs, since that would be bad for performance and
would flood logfiles. My PR caused a connection attempt to be cancelled
during the test, which triggered the error log that I am removing in
this PR.
Note that with this PR, the TLS creds `cancel_check_peer()` methods are
not actually doing anything with the status. Ideally, they should be
passing the status through to the verifier's `Cancel()` method, but we
apparently didn't add a parameter for that, which means that although
cancellation will work now, it will not properly pass through the right
error message. At some point, we should fix this and add tests covering
cancellation of async verifier requests to prove that the error message
is propagated correctly.
Bazelify tests from "linux/grpc_bazel_build" kokoro job by creating 3
bazelified tests - "build with strict warning", "build with no_xds=True"
and "build with no_xds=True negative test".
- also make the original "linux/grpc_bazel_build" kokoro job a no-op
(since bazelified tests now provide the same coverage).
The deleted code here was overriding the
[intended](866fc41067/tools/run_tests/run_tests.py (L62))
default test env of `GRPC_VERBOSITY=DEBUG`.
I'm just deleting it because it looks like`GRPC_TRACE=api` is not having
any affect anyways, since it relies on `GRPC_VERBOSITY=DEBUG` which it
happens to be unsetting.
If the client calls LookupHostname again within the on_resolve callback,
it re-acquires the `request_mu_` before releasing it which results in
deadlock.
With this PR it extracts the request and releases the lock before
calling on_resolve callback so it won't deadlock any more.
https://github.com/grpc/grpc/pull/33538 added `-weak_framework
CoreFoundation` in `DLDFLAGS` for only `arm64-darwin` builds, but the
issue reported in https://github.com/grpc/grpc/issues/33483 can also
happen on `x86-darwin` builds. This can happen if:
1. The Ruby interpreter is compiled without
`-Wl,-undefined,dynamic_lookup`.
2. This happens if the Ruby interpreter is built with XCode 14.0 to 14.2
(https://bugs.ruby-lang.org/issues/19005).
Simplify the logic and always include `-weak_framework CoreFoundation`
for macOS builds.
Changes -
* Remove `csm.remote_workload_pod_name` and
`csm.remote_workload_container_name`.
* Add `csm.remote_workload_name`, the value for which is sent through
MetadataExchange, from the `CSM_WORKLOAD_NAME` env var. (Note that this
is not added in local labels.)
* Add a local `csm.canonical_service` (@markdroth, please verify the key
that we want here) that is read from `CSM_CANONICAL_SERVICE_NAME` env
var, and we continue to send it over via MetadataExchange
- make C-core basictests use `--build_only` when running as bazelified
tests. This is because the volume of C core tests is expected to grow
very significantly after https://github.com/grpc/grpc/pull/34419 and
currently the non-bazelified counterpart of the tests (the presubmit
grpc_basictests_c_cpp_build_only job) is also "build only".
- make the linux presubmit job `grpc_basictests_c_cpp_build_only` a
noop, since the bazelified tests already give the same coverage on
presubmit.
Revert the reversion of the SSL_CTX_new change (#34355 reverted #34180 )
with a fix.
There was an issue with using `strcpy` on a `new[] string` in the
constructor of `ssl_credentials`. An ASAN test caught this in some CI
down the line - `ERROR: AddressSanitizer: alloc-dealloc-mismatch
(operator new [] vs free)`
That `strcpy` call was changed to `grp_strdup` which duplicates a string
in a way that can be freed by `gpr_free` and should resolve the ASAN
failure.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Added a separate distribtests for gRPC C++ DLL build on Windows. This
DLL build is a community support so it should be independently run from
the existing Windows distribtests. Actual DLL test will be added.
We're seeing some reports of the ping abuse policy not working like it
ought... add some tracing here to debug.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
This pull request adds another hook service on the maintenance server.
This will enable clients to gradually migrate from the standalone hook
server.
Changes:
1. Hook service can now be used separately.
2. Copied latest protos and updated the hook service to new API.
3. Added the hook service to the maintenance server.
Working towards testing against CSM Observability. Added ability to
register a prometheus exporter with our Opentelemetry plugin. This will
allow our metrics to be available at the standard prometheus port
`:9464`.
This fixes a deadlock seen when both the
`round_robin_delegate_to_pick_first` and
`client_channel_subchannel_wrapper_work_serializer_orphan` experiments
are enabled -- although I think the bug really has to do only with the
latter.
The problem here was that we were unreffing the picker while holding the
channel's LB mutex. Destroying the picker was triggering a
`SubchannelWrapper` to be orphaned, which triggered a hop into the
`WorkSerializer`. Once there, we were also running a queued subchannel
connectivity state notification, which triggered an update to the LB
policy, which triggered returning a new picker to the channel, which
tried to acquire the channel's LB mutex again.
Note that the `work_serializer_dispatch` experiment would have avoided
this problem.
grpc 1.57.0 crashes win ruby and alpine due to no `strdup` in musl libc.
This diff replace `strdup` with `grp_strdup`
```
Thread 1 "ruby" received signal SIGSEGV, Segmentation fault.
0x00000000000a4596 in ?? ()
(gdb) bt
#0 0x00000000000a4596 in ?? ()
#1 0x00007ffff14e298c in grpc_rb_channel_create_in_process_add_args_hash_cb (key=<optimized out>, val=<optimized out>, args_obj=<optimized out>) at rb_channel_args.c:84
#2 0x00007ffff7c2b9ea in hash_ar_foreach_iter (error=0, argp=140737488344784, value=<optimized out>, key=<optimized out>) at hash.c:1341
```
fixes#34044closes#27995
Since many tests now run reliably as bazelified tests on RBE, we can
remove them from presubmit runs
to speedup testing of PRs.
(for now, these jobs will still run on master, they can be removed from
master as a followup).
- linux/grpc_distribtests_standalone is now fully covered by bazel test
suite
a3b4c797a7/tools/bazelify_tests/test/BUILD (L202),
setting them to `presubmit=False` will stop tests from running on PRs.
- stop running tests from grpc_bazel_distribtest on PR, instead rely on
bazel distribtests running as bazelified tests.
We have a bunch of experiments testing against core e2e - and this is
good for robustness, bad for CI times.
We also have a bunch of marginal but overall necessary fixtures in the
e2e suites - again good for robustness, bad for CI times.
We can eliminate some of the cross product though, and I think safely:
run experiments on a broad range of suites, but not *ALL* the suites,
and get a bunch of our CI time back.
Here I introduce an environment variable: `GRPC_CI_EXPERIMENTS` that's
set when running bazel @experiment= configs, cleared otherwise (so we
can still execute those tests directly when necessary). When that env
var is set we filter out a bunch of suites from the test configurations.
This is just an initial scope of tests. Much of this code was written by
@ginayeh . I just did the final polish/integration step.
There are 3 main tests included:
1. The GAMMA baseline test, including the [actual GAMMA
API](https://gateway-api.sigs.k8s.io/geps/gep-1426/) rather than vendor
extensions.
2. Kubernetes-based stateful session affinity tests, where the mesh
(including SSA configuration) is configured using CRDs
3. GCP-based stateful session affinity tests, where the mesh is
configured using the networkservices APIs directly
Tests 1 and 2 will run in both prod and GKE staging, i.e.
`container.googleapis.com` and
`staging-container.sandbox.googleapis.com`. The latter of these will act
as an early detection mechanism for regressions in the controller that
translates Gateway resources into networkservices resources.
Test 3 will run against `staging-networkservices.sandbox.googleapis.com`
to act as an early detection mechanism for regressions in the control
plane SSA implementation.
The scope of the SSA tests is still fairly minimal. Session drain
testing is in-progress but not included in this PR, though several
elements required for it are (grace period, pre-stop hook, and the
ability to kill a single pod in a deployment).
---------
Co-authored-by: Jung-Yu (Gina) Yeh <ginayeh@google.com>
Co-authored-by: Sergii Tkachenko <sergiitk@google.com>
Add some basic metrics to work serializer, keep them process wide for
now (though it may be interesting to get these into channelz in the
future).
Collected are:
- time spent running a work serializer when it starts
- time spent actually executing work when the work serializer runs
- number of items executed each run
A high disparity between the first two indicates our dispatching
mechanism is adding large amounts of latency (perhaps due to thread
starvation like effects).
A high value for any of these indicate contention on the serializer.
It's likely a future iteration on these will select different metrics -
I'm not *entirely* sure which will be useful in production analysis yet.
I'm using `std::chrono::steady_clock` here for precision (nanoseconds)
with a compact representation (better than timespec) and a robust &
portable api - I think it's appropriate for metrics, but wouldn't use it
much beyond that at this point.
The one in xds_override_host was the one that was actually triggering
test failures, but I audited all of the other policies and fixed a
couple of other places that could also be problematic.
This test assumed synchronous work serializer execution (or at least
faster async than we always get)... make a trivial change to keep the
test semantics but allow for the implementation to be more async.