Fun edge case: when `rand_string()` happen to generate numbers only,
yaml interprets `deployment_id` label value as an integer,
but k8s expects label values to be strings.
K8s responds with a barely readable 400 Bad Request error:
`ReadString: expects \" or n, but found 9, error found in #10 byte of ...|ent_id`.
Prepending deployment name forces deployment_id into a string,
as well as it's just a better description.
When we use retryers with `log_level=logging.INFO`, tenacity logs the result value (or an exception) after each unsuccessful retry attempt.
We often retry methods that return objects, resulting in unreadable log messages:
```
I0820 03:16:29.027635 140613877811008 before_sleep.py:45] Retrying framework.xds_k8s_testcase.IsolatedXdsKubernetesTestCase.cleanup in 10.0 seconds as it raised RetryError: RetryError[Attempts: 21, Value: {'api_version': 'v1',
'kind': 'Namespace',
'metadata': {'annotations': None,
'cluster_name': None,
'creation_timestamp': datetime.datetime(2022, 8, 20, 2, 55, 32, tzinfo=tzlocal()),
'deletion_grace_period_seconds': None,
'deletion_timestamp': datetime.datetime(2022, 8, 20, 3, 6, 27, tzinfo=tzlocal()),
'finalizers': None,
'generate_name': None,
'generation': None,
'labels': {'kubernetes.io/metadata.name': 'psm-interop-server-20220820-0253-yrmam',
'name': 'psm-interop-server-20220820-0253-yrmam',
'owner': 'xds-k8s-interop-test'},
'managed_fields': [{'api_version': 'v1',
'fields_type': 'FieldsV1',
'fields_v1': {'f:metadata': {'f:labels': {'.': {},
'f:kubernetes.io/metadata.name': {},
... (82 more lines)
```
This PR introduces custom `before_sleep` logger, that only logs the value if it's a primitive: `int, str, bool`.
Otherwise, it logs the type, example:
```
k8s_base_runner.py:311] Waiting for pod psm-grpc-client-5d5648478f-7vsf7 to start
retryers.py:192] Retrying framework.infrastructure.k8s.KubernetesNamespace.get_pod in 1.0 seconds as it returned type <class 'kubernetes.client.models.v1_pod.V1Pod'>.
retryers.py:192] Retrying framework.infrastructure.k8s.KubernetesNamespace.get_pod in 1.0 seconds as it returned type <class 'kubernetes.client.models.v1_pod.V1Pod'>.
```
Note that this only changes the behavior of the unsuccessful retries, and doesn't affect the new feature that prints formatted k8s status field on if the *final* retry attempt failed.
- Added support for pod log collection. To enable, set `--collect_app_logs` flag, and specify `--log_dir`.
- Added support and helpers for operating on the `--log_dir` (natively provided by absl)
- Added support for `--follow` to `bin/run_test_server.py` and `bin/run_test_client.py` to follow pod logs printed to stdout
- Moved `PortForwarder` from k8s.py to its own file
The collection itself will be enabled per-suite in https://github.com/grpc/grpc/pull/30735.
* xDS interop: Fix default resource prefix
No longer just security tests.
This is done to avoid confusion when debugging resources managed
by the LB tests.
* s/xds/psm
All alternative server runners except the failover test reuse the primary server runners' namespace. Failover test is using the secondary cluster, and manages its own namespace there. `reuse_namespace` disables namespace cleanup, and in this case it was set to `True` incorrectly.
- Changes the order of waiting for pods to start: wait for the pods first, then for the deployment to transition to active. This should provide more useful information in the logs, showing exactly why the pod didn't start, instead of generic "Replicas not available" ref b/200293121. This also needed for https://github.com/grpc/grpc/pull/30594
- Add support for `check_result` callback in the retryer helpers
- Completely replaces `retrying` with `tenacity`, ref b/200293121. Retrying is not longer maintained.
- Improves the readability of timeout errors: now they contain the timeout (or the attempt number) exceeded, and information why the timeout failed (exception/check function):
Before:
> `tenacity.RetryError: RetryError[<Future at 0x7f8ce156bc18 state=finished returned dict>]`
After:
> `framework.helpers.retryers.RetryError: Retry error calling framework.infrastructure.k8s.KubernetesNamespace.get_pod: timeout 0:01:00 exceeded. Check result callback returned False.`
- Improves the readability of the k8s wait operation errors: now the log includes colorized and formatted status of the k8s object being watched, instead of dumping the full k8s object. For example, here's how an error caused by using incorrect TD bootstrap image:
* Enable outlier detection k8s interop test for Java. (#30641)
* xDS interop: enable outlier detection Java tests in >= 1.49.x
Co-authored-by: Terry Wilson <terrymwilson@gmail.com>
pod_name shouldn't be a part of the test app, it's purely k8s' idiom.
Originally server_id was intended for this purpose, but it was missed
when support for multiple server replicas added.
This replaces pod_name and server_id with hostname and improves
replica-specific log messages, so it's clear to what server
RPCs are issued.
In addition, now all RPC logs are annotated with the hostname:port,
so the destination is clear.
Before:
```
server_app.py:76] Setting health status to serving
grpc.py:60] RPC XdsUpdateHealthService.SetServing(request=Empty({}), timeout=90, wait_for_ready=True)
grpc.py:60] RPC Health.Check(request=HealthCheckRequest({}), timeout=90, wait_for_ready=True)
server_app.py:78] Server reports status: SERVING
```
After:
```
server_app.py:89] [psm-grpc-server-69bcf749c5-bg4x5] Setting health status to NOT_SERVING
grpc.py:72] [psm-grpc-server-69bcf749c5-bg4x5:52902] RPC XdsUpdateHealthService.SetNotServing(request=Empty({}), timeout=90, wait_for_ready=True)
grpc.py:72] [psm-grpc-server-69bcf749c5-bg4x5:52902] RPC Health.Check(request=HealthCheckRequest({}), timeout=90, wait_for_ready=True)
server_app.py:92] [psm-grpc-server-69bcf749c5-bg4x5] Health status status: NOT_SERVING
```
Similarly, this adds hostname to the client app, mainly for logging.
In python tests that require set_not_serving server RPC, override
the python server with the reference server (Java) because
the python server doesn't yet support set_not_serving RPC.
Ref https://github.com/grpc/grpc/issues/30635.
This fixes an issue with KubernetesNamespace.list_deployment_pods()
as well as the deployment itself would select incorrect pods
when multiple deployments share the same namespace.
Separates xDS Test Client/Server (represent an interface to corresponding workload running remotely) from their runners (kubernetes-specific logic to provision the workloads with prerequisites).
This is a refactoring, should not change the behavior.
Some tests override unittest's `tearDown()`, which is not wrong, but less resilient than overriding custom `cleanup()` that is being retried in framework's `tearDown()`.
- xDS interop: add support for the reference xds test server
- Set default xDS test server reference to Java `v1.48.1`
- Override xDS test server with the reference in Outlier Detection
To improve debugging of the tests with steps that look similar, f.e. failover.
Makes the end of one subtest, and the beginning of the next one much clearer.
Note: URL map test suite does not use subtests, so I didn't add the logging there.
`kubernetes` library does not provide a way to configure the default socket timeout that will be used with `urllib3` it uses under the hood. And `urllib3` default socket timeout is infinity.
This PR sets the default socket timeout using python's `socket.setdefaulttimeout()` to 60 seconds.
This affects `urllib3` directly, and therefore `kubernetes`.
The changes is also picked up by the `google-api-python-client`, which does not use `urllib3` (it uses `httplib2`), but [respectes](https://googleapis.github.io/google-api-python-client/docs/epy/googleapiclient.http-module.html#build_http) `socket.setdefaulttimeout()`.
Add consistent operation id logs for GCP long-running operations - both old-style (compute) and the new APIs.
At the moment it's a bit more verbose than I'd want, f.e. it doubles the number of log messages during the teardown. We should probably only log failed ops. But to do this reliably, we should probably revisit the issue with improving tenacity retry error fail reports.
* Add xDS interop test for outlier detection
This implements the test described in #29623, and plumbing for setting the
outlierDetection field in the backend service config. The changes in this PR
are very similar to #29688.
* Fix use of configure method
* Correct copy/paste error
* Fix metadata configuration syntax
* Increase QPS, use just one method
* Format code
* Apply suggestions from code review
Co-authored-by: Sergii Tkachenko <hi@sergii.org>
* Address review comments
* Only Java implements the required server features
* Automated change: Fix sanity tests
* Address review comments
* Use double quotes for docstring
Co-authored-by: Sergii Tkachenko <hi@sergii.org>
Co-authored-by: Sergii Tkachenko <hi@sergii.org>
Co-authored-by: murgatroid99 <murgatroid99@users.noreply.github.com>
Resume the failover test. For now, just on master. Will be resumed on other branches, when the fix is backported.
At the moment, the master is fixed in java and go.
ref b/238226704
All tests that use `assertRpcsEventuallyGoToGivenServers` method were
reporting successes when the assertion failed:
- FailoverTest
- ChangeBackendServiceTest
- RemoveNegTest
Added a couple of tests which run the baseline_test with all released
bootstrap generator versions on client and server. These tests will be
run on a continuous integration environment with gRPC servers and
clients built using the latest released version of gRPC in one selected
language.
* Add supported Node version ranges in xDS k8s url_map tests
This adds is_supported implementations for most of the url_map tests that didn't
already have them. The exception is metadata_filter_test because it doesn't use
any specific client features.
* Fix formatting
* Improve timeout test check order
1. Fixes the issue with Java PSM security tests accidentally skipped because Java was missing from the list of languages, ref https://github.com/grpc/grpc/pull/28978
2. Invert the logic of `is_supported` methods, making them normally open
3. Make languages an `enum.Flag` to avoid accidental typos when listing the languages
4. Rename `XdsKubernetesTestCase.isSupported` to `XdsKubernetesTestCase.is_supported` to be consistent with `XdsUrlMapTestCase.is_supported`
5. Add extra logging
Split `XdsKubernetesTestCase` into:
- `XdsKubernetesTestCase` top-level base class containing flag parsing logic and common `assert*` methods
- `XdsKubernetesIsolatedTestCase` extending `XdsKubernetesTestCase`, that is specific to tests that want to create ifra resources before each test, and destroy them after.
Now tests that don't need to create/destroy all resources on each run, can extend `XdsKubernetesTestCase` without having to override all setUp and implementing other unnecessary methods.