Since many tests now run reliably as bazelified tests on RBE, we can
remove them from presubmit runs
to speedup testing of PRs.
(for now, these jobs will still run on master, they can be removed from
master as a followup).
- linux/grpc_distribtests_standalone is now fully covered by bazel test
suite
a3b4c797a7/tools/bazelify_tests/test/BUILD (L202),
setting them to `presubmit=False` will stop tests from running on PRs.
- stop running tests from grpc_bazel_distribtest on PR, instead rely on
bazel distribtests running as bazelified tests.
We have a bunch of experiments testing against core e2e - and this is
good for robustness, bad for CI times.
We also have a bunch of marginal but overall necessary fixtures in the
e2e suites - again good for robustness, bad for CI times.
We can eliminate some of the cross product though, and I think safely:
run experiments on a broad range of suites, but not *ALL* the suites,
and get a bunch of our CI time back.
Here I introduce an environment variable: `GRPC_CI_EXPERIMENTS` that's
set when running bazel @experiment= configs, cleared otherwise (so we
can still execute those tests directly when necessary). When that env
var is set we filter out a bunch of suites from the test configurations.
This is just an initial scope of tests. Much of this code was written by
@ginayeh . I just did the final polish/integration step.
There are 3 main tests included:
1. The GAMMA baseline test, including the [actual GAMMA
API](https://gateway-api.sigs.k8s.io/geps/gep-1426/) rather than vendor
extensions.
2. Kubernetes-based stateful session affinity tests, where the mesh
(including SSA configuration) is configured using CRDs
3. GCP-based stateful session affinity tests, where the mesh is
configured using the networkservices APIs directly
Tests 1 and 2 will run in both prod and GKE staging, i.e.
`container.googleapis.com` and
`staging-container.sandbox.googleapis.com`. The latter of these will act
as an early detection mechanism for regressions in the controller that
translates Gateway resources into networkservices resources.
Test 3 will run against `staging-networkservices.sandbox.googleapis.com`
to act as an early detection mechanism for regressions in the control
plane SSA implementation.
The scope of the SSA tests is still fairly minimal. Session drain
testing is in-progress but not included in this PR, though several
elements required for it are (grace period, pre-stop hook, and the
ability to kill a single pod in a deployment).
---------
Co-authored-by: Jung-Yu (Gina) Yeh <ginayeh@google.com>
Co-authored-by: Sergii Tkachenko <sergiitk@google.com>
Add some basic metrics to work serializer, keep them process wide for
now (though it may be interesting to get these into channelz in the
future).
Collected are:
- time spent running a work serializer when it starts
- time spent actually executing work when the work serializer runs
- number of items executed each run
A high disparity between the first two indicates our dispatching
mechanism is adding large amounts of latency (perhaps due to thread
starvation like effects).
A high value for any of these indicate contention on the serializer.
It's likely a future iteration on these will select different metrics -
I'm not *entirely* sure which will be useful in production analysis yet.
I'm using `std::chrono::steady_clock` here for precision (nanoseconds)
with a compact representation (better than timespec) and a robust &
portable api - I think it's appropriate for metrics, but wouldn't use it
much beyond that at this point.
The one in xds_override_host was the one that was actually triggering
test failures, but I audited all of the other policies and fixed a
couple of other places that could also be problematic.
This test assumed synchronous work serializer execution (or at least
faster async than we always get)... make a trivial change to keep the
test semantics but allow for the implementation to be more async.
This reverts commit 2db446aa9a.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Original PR was #34307, reverted in #34318 due to internal test
failures.
The first commit is a revert of the revert. The second commit contains
the fix.
The original idea here was that `SubchannelWrapper::Orphan()`, which is
called when the strong refcount reaches 0, would take a new weak ref and
then hop into the `WorkSerializer` before dropping that weak ref, thus
ensuring that the `SubchannelWrapper` is destroyed inside the
`WorkSerializer` (which is needed because the `SubchannelWrapper` dtor
cleans up some state in the channel related to the subchannel). The
problem is that `DualRefCounted<>::Unref()` itself actually increments
the weak ref count before calling `Orphan()` and then decrements it
afterwards. So in the case where the `SubchannelWrapper` is unreffed
outside of the `WorkSerializer` and no other thread happens to be
holding the `WorkSerializer`, the weak ref that we were taking in
`Orphan()` was unreffed inline, which meant that it wasn't actually the
last weak ref -- the last weak ref was the one taken by
`DualRefCounted<>::Unref()`, and it wasn't released until after the
`WorkSerializer` was released.
To this this problem, we move the code from the `SubchannelWrapper` dtor
that cleans up the channel's state into the `WorkSerializer` callback
that is scheduled in `Orphan()`. Thus, regardless of whether or not the
last weak ref is released inside of the `WorkSerializer`, we are
definitely doing that cleanup inside the `WorkSerializer`, which is what
we actually care about.
Also adds an experiment to guard this behavior.
This has been stable for a bit, everywhere that the EventEngine is
enabled. Going forward, I think the event_engine_{client|listener}
experiments can probably be used to regulate thread-pool-specific
issues.
---------
Co-authored-by: drfloob <drfloob@users.noreply.github.com>
I've added channel args to `CreateNewServerCallTracer` on the
`ServerCallTracerFactory`.
The motivation is for CSM Observability where the OTel plugin will be
configured to only do stats on servers which are xDS enabled, so I plan
to check this via channel args.
In the future, with the new scopes for metrics, I think I'll be able to
change this to only check once per server or server connection instead
of per call.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Move the SSL_CTX to the level of the credentials rather than the
subchannel.
The SSL_CTX should only get created once per credential rather than once
per subchannel.
We should observe no behavior change with this PR, only efficiency
gains.
1. Switch to CMake 1.18.
2. Make ergonomic change to push_testing_images.sh to allow building
just a single image.
3. Update packages to reduce a number of vulnerabilities reported.
Most recent attempt was #34320, reverted in #34335.
The first commit here is a pure revert. The second commit fixes the
outlier_detection unit test to pass both with and without the
experiment.
This is a follow up to https://github.com/grpc/grpc/pull/34103
That pull request explicitly aimed to introduce shared library builds
for Windows (DLLs) while effecting zero material change to the existing
build pipelines. That aspiration meant that the grpc++_unsecure library
had to be effectively excluded from the build (because including it
would have also included a dependency on openssl, which makes no sense
given its purpose)
This PR addresses that by:
* Extracting the single function in grpc_tls_certificate_provider with a
dependency on openssl into a separate compilation unit
* Including that new .cc file into the grpc library
* Including grpc_tls_certificate_provider and one other source file into
grpc_unsecure for the Windows DLL build only.
* Reinstating the grpc++_unsecure library which is a prerequisite for
many tests.
* Regenerating all files affected by the changes in Bazel BUILD that
introduce the new source file.
This change does affect the operation of other build pipelines - I have
confirmed that it does not break the Linux Bazel build.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
To fix the following build error with the head of abseil
```
/var/local/git/grpc/test/core/tsi/ssl_transport_security_utils_test.cc:231:42: error: no member named 'StrCat' in namespace 'absl'
return absl::InternalError(absl::StrCat("Client error:", client_err));
~~~~~~^
/var/local/git/grpc/test/core/tsi/ssl_transport_security_utils_test.cc:238:42: error: no member named 'StrCat' in namespace 'absl'
return absl::InternalError(absl::StrCat("Server error:", server_err));
~~~~~~^
```
Distribtests is failing with the following error:
```
Collecting twine<=2.0
Downloading twine-2.0.0-py3-none-any.whl (34 kB)
Collecting pkginfo>=1.4.2 (from twine<=2.0)
Downloading pkginfo-1.9.6-py3-none-any.whl (30 kB)
Collecting readme-renderer>=21.0 (from twine<=2.0)
Obtaining dependency information for readme-renderer>=21.0 from 992e0e21b36c98bc06a55e514cb323/readme_renderer-42.0-py3-none-any.whl.metadata
Downloading readme_renderer-42.0-py3-none-any.whl.metadata (2.8 kB)
Collecting requests>=2.20 (from twine<=2.0)
Obtaining dependency information for requests>=2.20 from 0e2d847013cd6965bf26b47bc0bf44/requests-2.31.0-py3-none-any.whl.metadata
Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting requests-toolbelt!=0.9.0,>=0.8.0 (from twine<=2.0)
Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl (54 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.5/54.5 kB 5.1 MB/s eta 0:00:00
Requirement already satisfied: setuptools>=0.7.0 in ./venv/lib/python3.9/site-packages (from twine<=2.0) (68.2.0)
Collecting tqdm>=4.14 (from twine<=2.0)
Obtaining dependency information for tqdm>=4.14 from f12a80907dc3ae54c5e962cc83037e/tqdm-4.66.1-py3-none-any.whl.metadata
Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 7.6 MB/s eta 0:00:00
Collecting nh3>=0.2.14 (from readme-renderer>=21.0->twine<=2.0)
Downloading nh3-0.2.14.tar.gz (14 kB)
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'done'
Preparing metadata (pyproject.toml): started
Preparing metadata (pyproject.toml): finished with status 'error'
error: subprocess-exited-with-error
× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [6 lines of output]
Cargo, the Rust package manager, is not installed or is not on PATH.
This package requires Rust and Cargo to compile extensions. Install it through
the system's package manager or via https://rustup.rs/
Checking for Rust toolchain....
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
```
### Why
* We're pulling readme_renderer 42.0 from twine, since 42.0 requires nh3
and nh3 requires Rust, the test is failing.
### Fix
* Pinged readme_renderer to `<40.0` since any version higher or equal to
40.0 requires Python 3.8.
### Testing
* Passed manual run: http://sponge/57d815a7-629f-455f-b710-5b80369206cd
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Reverts grpc/grpc#34325
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->