Ditch the old priority scheme for ordering filters, instead explicitly
mark up before/after constraints.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
The basic APIs for the CRL Reloading features.
This adds external types to represent CRL Providers, CRLs, and
CertificateInfo.
Internally we will use `CrlImpl` - this layer is needed to hide OpenSSL
details from the user.
GRFC - https://github.com/grpc/proposal/pull/382
Things Done
* Add external API for `CrlProvider`, `Crl`, `CertInfo` (`CertInfo` is
used during CRL lookup rather than passing the entire certificate).
* Add code paths in `ssl_transport_security` to utilize CRL providers
* Add `StaticCrlProvider`
* Refactor `crl_ssl_transport_security_test.cc` so it is more extensible
and can be used with providers
1. Changes the resource retention period to 2 days for all resources
(previously 7 days for TD resources, 6 hours for k8s). This solved a
problem with k8s resources being stuck because corresponding TD
resources weren't deleted.
2. Resume on namespace cleanup failures
3. Add secondary lb cluster cleanup logic
4. Modularize `grpc_xds_resource_cleanup.sh`
5. Make `KubernetesNamespace`'s methods `pretty_format_status` and
`pretty_format_metadata` public
6. `pretty_format_status`: also print resource kind, creation and
deletion requested dates
ref b/259724370, cl/517235715
We're seeing too many debug headers, change it to only log header in
case of error.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Instead of fixing a target size for writes, try to adapt it a little to
observed bandwidth.
The initial algorithm tries to get large writes within 100-1000ms
maximum delay - this range probably wants to be tuned, but let's see.
The hope here is that on slow connections we can not back buffer so much
and so when we need to send a ping-ack it's possible without great
delay.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
We're not running any test at all from `run_test.py` because of the way
we filter test cases:
1d136fd05f/src/python/grpcio_tests/tests/_runner.py (L137)
* `testcase_filter` is read from a json file (like [this
one](https://github.com/grpc/grpc/blob/master/src/python/grpcio_tests/tests/tests.json))
and test name is similar to `unit._metadata_test.MetadataTest`.
* `case.id()` is loaded by `iterate_suite_cases` and will always have a
prefix of `tests`, an example of case id will be:
`tests.unit._metadata_test.MetadataTest`.
Because of the prefix, none of the test case will be matched thus we're
not running any of the tests.
This PR fixes the prefix issue and all the regressions comes from not
running tests using `run_test.py`.
#### Other Changes
* Added couple of `__init__.py` file since it's required to load tests.
* Added `py_status_code` to Aio rpc state.
* `code()` is expecting to return a python gRPC code but current
`status_code` is a Cython code.
* Added `libsqlite3-dev` to our dockers because it's required for
`coverage==7.2.0`.
* Renamed csds and admin test because test case file have to end with
`_test`:
1d136fd05f/src/python/grpcio_tests/tests/_loader.py (L26)
* Removed gevent test from `run_test.py` because Bazel gevent tests
should be good enough for us.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
The `TickForDuration()` method was using `grpc_core::Timestamp::Now()`
to get the current time, but that was not in sync with the `now_` value
inside the Fuzzing EE itself, with the result that after two subsequent
250ms increments, timers were not being properly fired. I've added a
test that demonstrates this failure without the fix.
Experiment 1: On RST_STREAM: reduce MAX_CONCURRENT_STREAMS for one round
trip.
Experiment 2: If a settings frame is outstanding with a lower
MAX_CONCURRENT_STREAMS than is configured, and we receive a new incoming
stream that would exceed the new cap, randomly reject it.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
This is the initial change of chaotic-good client transport read path,
which is a following PR of the client transport write path at #33876.
There's a pending work of handling endpoint failures in the transport.
It will be added after we have the inter-activity pipe with close
function.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
- Fixes support for the same address being present more than once in the
address list, which was accidentally broken in #34244.
- Change the call attribute to encode the hash as an integer instead of
a string.
We shouldn't just set `termination_grace_period_seconds=600` by default
for all gamma tests extending `GammaXdsKubernetesTestCase`.
This is what's causing the deployment deletion issue:
> `framework.helpers.retryers.RetryError: Retry error calling
framework.xds_k8s_testcase.IsolatedXdsKubernetesTestCase.cleanup: 1
attempts exhausted. Last exception: RetryError: Retry error calling
framework.infrastructure.k8s.KubernetesNamespace.get_deployment: timeout
0:05:00 (h:mm:ss) exceeded. Check result callback returned False.`
We wait for 5 minutes, while the deployment is happily handing for 10.
Then the second cleanup retry kills it - but not before waiting for
another 5 minutes.
I think `self.force = False` may be solving another issue triggered by
the get_deployment retry timeout: because we start over deleting the
resources by name and some of them are deleted from the first attempt we
get 404. And I'm pretty sure we don't do error-handling correctly when
deleting CRD-based resources - which cascades into even more unnecessary
retries.
- GAMMA server runner: increase the wait time for the NEG annotation
from 1 minute to 3.
- Improve the wording around the wait for NEG methods to make it clear
this is the annotation we're waiting for - so it's not confused with
getting the NEG health from the GCP APIs.
ref b/298501683, b/302723651
We originally wanted a common format with internal OWNERS files on the
thought that we'd use them and import them and oh gosh that was the
wrong thing to think.
With upcoming changes to tree management having them here is going to
cause problems, so lets just update the CODEOWNERS files directly.
Isolate ping callback tracking to its own file.
Also takes the opportunity to simplify keepalive code by applying the
ping timeout to all pings.
Adds an experiment to allow multiple pings outstanding too (this was
originally an accidental behavior change of the work, but one that I
think may be useful going forward).
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Waiting for an active channel takes less time in non-gamma test suites
because they only start waiting after already waited for the TD backends
to be created and report healthy.
In GAMMA, these resources are created asynchronously by Kubernetes. To
compensate for this, we double the timeout for GAMMA tests.
This change is to update the TD bootstrap generator for prod tests. This
is part of the TD release process. The new image has already been merged
to staging and tested locally in google3.
cc: @sergiitk PTAL.
Passed manual run:
https://fusion2.corp.google.com/invocations/0e337438-8573-4b54-ba1c-242b1ad58586
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Support Python 3.12.
### Testing
* Passed all Distribution Tests.
* Also tested locally by installing 3.12 artifact.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
The deleted code here was overriding the
[intended](866fc41067/tools/run_tests/run_tests.py (L62))
default test env of `GRPC_VERBOSITY=DEBUG`.
I'm just deleting it because it looks like`GRPC_TRACE=api` is not having
any affect anyways, since it relies on `GRPC_VERBOSITY=DEBUG` which it
happens to be unsetting.
Added a separate distribtests for gRPC C++ DLL build on Windows. This
DLL build is a community support so it should be independently run from
the existing Windows distribtests. Actual DLL test will be added.
Working towards testing against CSM Observability. Added ability to
register a prometheus exporter with our Opentelemetry plugin. This will
allow our metrics to be available at the standard prometheus port
`:9464`.
Since many tests now run reliably as bazelified tests on RBE, we can
remove them from presubmit runs
to speedup testing of PRs.
(for now, these jobs will still run on master, they can be removed from
master as a followup).
- linux/grpc_distribtests_standalone is now fully covered by bazel test
suite
a3b4c797a7/tools/bazelify_tests/test/BUILD (L202),
setting them to `presubmit=False` will stop tests from running on PRs.
- stop running tests from grpc_bazel_distribtest on PR, instead rely on
bazel distribtests running as bazelified tests.
This is just an initial scope of tests. Much of this code was written by
@ginayeh . I just did the final polish/integration step.
There are 3 main tests included:
1. The GAMMA baseline test, including the [actual GAMMA
API](https://gateway-api.sigs.k8s.io/geps/gep-1426/) rather than vendor
extensions.
2. Kubernetes-based stateful session affinity tests, where the mesh
(including SSA configuration) is configured using CRDs
3. GCP-based stateful session affinity tests, where the mesh is
configured using the networkservices APIs directly
Tests 1 and 2 will run in both prod and GKE staging, i.e.
`container.googleapis.com` and
`staging-container.sandbox.googleapis.com`. The latter of these will act
as an early detection mechanism for regressions in the controller that
translates Gateway resources into networkservices resources.
Test 3 will run against `staging-networkservices.sandbox.googleapis.com`
to act as an early detection mechanism for regressions in the control
plane SSA implementation.
The scope of the SSA tests is still fairly minimal. Session drain
testing is in-progress but not included in this PR, though several
elements required for it are (grace period, pre-stop hook, and the
ability to kill a single pod in a deployment).
---------
Co-authored-by: Jung-Yu (Gina) Yeh <ginayeh@google.com>
Co-authored-by: Sergii Tkachenko <sergiitk@google.com>