I've noticed we add the cleanup hook after setting up the
infrastructure. Thus, if infra setup failed, the cleanup won't work.
This fixes it, and adds extra checks to not call
`cls.test_client_runner` if it's not set.
Fail test if client or server pods restarted during test.
#### Testing
Tested locally, test will fail with message similar to:
```
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/google/home/xuanwn/workspace/xds/grpc/tools/run_tests/xds_k8s_test_driver/framework/xds_k8s_testcase.py", line 501, in tearDown
))
AssertionError: 5 != 0 : Server pods unexpectedly restarted {sever_restarts} times during test.
----------------------------------------------------------------------
Ran 1 test in 886.867s
```
Better logging for `assertRpcStatusCodes`.
(got tired of looking up the status names)
#### Unexpected status found
Before:
```
AssertionError: AssertionError: Expected only status 15 but found status 0 for method UNARY_CALL:
stats_per_method {
key: "UNARY_CALL"
value {
result {
key: 0
value: 251
}
}
}
```
After:
```
AssertionError: Expected only status (15, DATA_LOSS), but found status (0, OK) for method UNARY_CALL:
stats_per_method {
key: "UNARY_CALL"
value {
result {
key: 0
value: 251
}
}
}
```
#### No traffic with expected status
Before:
```
AssertionError: 0 not greater than 0
```
After:
```
AssertionError: 0 not greater than 0 : Expected non-zero RPCs with status (15, DATA_LOSS) for method UNARY_CALL, got:
stats_per_method {
key: "UNARY_CALL"
value {
result {
key: 0
value: 251
}
result {
key: 15
value: 0
}
}
}
```
Before this change, `Found subchannel in state READY` and `Channel to
xds:///psm-grpc-server:61404 transitioned to state ` would dump the full
channel/subchannel, in some implementations that expose
ChannelData.trace (f.e. go) would add 300 extra lines of log.
Now we print a brief repr-like chanel/subchannel info:
```
Found subchannel in state READY: <Subchannel subchannel_id=9 target=10.110.1.44:8080 state=READY>
Channel to xds:///psm-grpc-server:61404 transitioned to state READY: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=READY>
```
Also while waiting for the channel, we log channel_id now too:
```
Waiting to report a READY channel to xds:///psm-grpc-server:61404
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=READY>
```
Similar to what we already do in other test suites:
- Try cleaning up resources three times.
- If unsuccessful, don't fail the test and just log the error. The
cleanup script should be the one to deal with this.
ref b/282081851
Resolve `TESTING_VERSION` to `dev-VERSION` when the job is initiated by
a user, and not the CI. Override this behavior with setting
`FORCE_TESTING_VERSION`.
This solves the problem with the manual job runs executed against a WIP
branch (f.e. a PR) overriding the tag of the CI-built image we use for
daily testing.
The `dev` and `dev-VERSION` "magic" values supported by the
`--testing_version` flag:
- `dev` and `dev-master` and treated as `master`: all
`config.version_gte` checks resolve to `True`.
- `dev-VERSION` is treated as `VERSION`: `dev-v1.55.x` is treated as
simply `v1.55.x`. We do this so that when manually running jobs for old
branches the feature skip check still works, and unsupported tests are
skipped.
This changes will take care of all langs/branches, no backports needed.
ref b/256845629
Previously the error message didn't provide much context, example:
```py
Traceback (most recent call last):
File "/tmpfs/tmp/tmp.BqlenMyXyk/grpc/tools/run_tests/xds_k8s_test_driver/tests/affinity_test.py", line 127, in test_affinity
self.assertLen(
AssertionError: [] has length of 0, expected 1.
```
ref b/279990584.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: Sergii Tkachenko <hi@sergii.org>
- Fix broken `bin/run_channelz.py` helper
- Create `bin/run_ping_pong.py` helper that runs the baseline (aka
"ping_pong") test against preconfigured infra
- Setup automatic port forwarding when running `bin/run_channelz.py` and
`bin/run_ping_pong.py`
- Create `bin/cleanup_cluster.sh` helper to wipe xds out resources based
namespaces present on the cluster
Note: this involves a small change to the non-helper code, but it's just
moving a the part that makes XdsTestServer/XdsTestClient instance for a
given pod.
While a proper fix is on the way, this mitigates the number of
duplicated container logs in the xds test server/client pod logs.
The issue is that we only wait between stream restarts when an exception
is caught, which isn't always the reason the stream gets broken. Another
reason is the main container being shut down by k8s. In this situation,
we essentially do
```py
while True:
try:
restart_stream()
read_all_logs_from_pod_start()
except Exception:
logger.warning('error')
wait_seconds(1)
```
This PR makes it
```py
while True:
try:
restart_stream()
read_all_logs_from_pod_start()
except Exception:
logger.warning('error')
finally:
wait_seconds(5)
```
`tearDownClass` is not executed when `setUpClass` failed. In URL Map
test suite, this leads to a test client that failed to start not being
cleaned up.
This PR change the URL Map test suite to register a custom
`addClassCleanup` callback, instead of relying on the `tearDownClass`.
Unlike `tearDownClass`, cleanup callbacks are executed when the
`setUpClass` failed.
ref b/276761453
Previously, we didn't configure the failureThreshold, so it used its
default value. The final `startupProbe` looked like this:
```json
{
"startupProbe": {
"failureThreshold": 3,
"periodSeconds": 3,
"successThreshold": 1,
"tcpSocket": {
"port": 8081
},
"timeoutSeconds": 1
}
```
Because of it, the total time before k8s killed the container was 3
times `failureThreshold` * 3 seconds wait between probes `periodSeconds`
= 9 seconds total (±3 seconds waiting for the probe response).
This greatly affected PSM Security test server, some implementations of
which waited for the ADS stream to be configured before starting
listening on the maintenance port. This lead for the server container
being killed for ~7 times before a successful startup:
```
15:55:08.875586 "Killing container with a grace period"
15:53:38.875812 "Killing container with a grace period"
15:52:47.875752 "Killing container with a grace period"
15:52:38.874696 "Killing container with a grace period"
15:52:14.874491 "Killing container with a grace period"
15:52:05.875400 "Killing container with a grace period"
15:51:56.876138 "Killing container with a grace period"
```
These extra delays lead to PSM security tests timing out.
ref b/277336725
This was missed from https://github.com/grpc/grpc/pull/32596, resulting
in k8s teardown exiting early with:
```
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: OMITTED
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.apps \"psm-grpc-client\" not found","reason":"NotFound","details":{"name":"psm-grpc-client","group":"apps","kind":"deployments"},"code":404}
```
With this change, teardown resumes with:
> `k8s_base_runner.py:282] Deployment psm-grpc-client deletion failed:
Kubernetes API returned 404 Not Found: deployments.apps
"psm-grpc-client" not found`
- Increase kubernetes library default for urlib3 retries to 10
- Add custom retry logic to all API calls made by framework.k8s
Custom retry logic handles various errors we're experienced over
two years, and based on ~140 failure reports:
1. Errors returned by the k8s API server itself:
- 401 Unauthorized
- 409 Conflict
- 429 Too Many Requests
- 500 Internal Server Error
2. Connection errors that might indicate k8s API server is temporarily
unavailable (such as a restart, upgrade, etc):
- All `NewConnectionError`s, f.e. "Connection timed out",
"Connection refused"
- All "connection aborted" `ProtocolError`s, f.e. "Remote end
closed connection without response", "Connection reset by peer"
ref b/178378578, b/258546394
PSM Interop: Local dev various improvements
- Cleanup resources on ctrl+c
- Add startup probes to address the issue with port forwarding starting
before the workload listens on a port
- Remove misleading restartPolicy: it's silently ignored by k8s
- Extra debug message with port-forwarding command
Our test clusters are in the broken/dirty state (except the URL map's
"basic" cluster, which isn't affected).
This PR switches to the newly created clusters to:
1. Get a data point on whether newly created clusters are affected
by the same issues.
2. Allow for descriptive work on the old clusters.
3. Hopefully, bring our tests back to green.
Bonus: more sensible cluster names.
This PR adds retries on create/get requests from the test driver to the K8s API when 401 Unauthorized error is encountered.
K8S python library expects the ApiClient to be cycled on auth token refreshes.
The problem is described in kubernetes-client/python#741. Currently we don't have any hypotheses why we weren't affected by this problem before.
To force the ApiClient to pick up the new credentials, I shut down the current client, create a new one, and replace api_client properties on all k8s APIs we manage.
This should also work with the Watch-based log collector recovering from an error. To support that, I replace default Configuration so that the next time Watch creates ApiClient implicitly, the Configuration with updated token will be used.
b/228743575 started happening more frequently and there's more important
fires to worry about. Silence this off-by-one flake to let us to come
back to it when we have a bit more time.
Fun edge case: when `rand_string()` happen to generate numbers only,
yaml interprets `deployment_id` label value as an integer,
but k8s expects label values to be strings.
K8s responds with a barely readable 400 Bad Request error:
`ReadString: expects \" or n, but found 9, error found in #10 byte of ...|ent_id`.
Prepending deployment name forces deployment_id into a string,
as well as it's just a better description.
When we use retryers with `log_level=logging.INFO`, tenacity logs the result value (or an exception) after each unsuccessful retry attempt.
We often retry methods that return objects, resulting in unreadable log messages:
```
I0820 03:16:29.027635 140613877811008 before_sleep.py:45] Retrying framework.xds_k8s_testcase.IsolatedXdsKubernetesTestCase.cleanup in 10.0 seconds as it raised RetryError: RetryError[Attempts: 21, Value: {'api_version': 'v1',
'kind': 'Namespace',
'metadata': {'annotations': None,
'cluster_name': None,
'creation_timestamp': datetime.datetime(2022, 8, 20, 2, 55, 32, tzinfo=tzlocal()),
'deletion_grace_period_seconds': None,
'deletion_timestamp': datetime.datetime(2022, 8, 20, 3, 6, 27, tzinfo=tzlocal()),
'finalizers': None,
'generate_name': None,
'generation': None,
'labels': {'kubernetes.io/metadata.name': 'psm-interop-server-20220820-0253-yrmam',
'name': 'psm-interop-server-20220820-0253-yrmam',
'owner': 'xds-k8s-interop-test'},
'managed_fields': [{'api_version': 'v1',
'fields_type': 'FieldsV1',
'fields_v1': {'f:metadata': {'f:labels': {'.': {},
'f:kubernetes.io/metadata.name': {},
... (82 more lines)
```
This PR introduces custom `before_sleep` logger, that only logs the value if it's a primitive: `int, str, bool`.
Otherwise, it logs the type, example:
```
k8s_base_runner.py:311] Waiting for pod psm-grpc-client-5d5648478f-7vsf7 to start
retryers.py:192] Retrying framework.infrastructure.k8s.KubernetesNamespace.get_pod in 1.0 seconds as it returned type <class 'kubernetes.client.models.v1_pod.V1Pod'>.
retryers.py:192] Retrying framework.infrastructure.k8s.KubernetesNamespace.get_pod in 1.0 seconds as it returned type <class 'kubernetes.client.models.v1_pod.V1Pod'>.
```
Note that this only changes the behavior of the unsuccessful retries, and doesn't affect the new feature that prints formatted k8s status field on if the *final* retry attempt failed.
- Added support for pod log collection. To enable, set `--collect_app_logs` flag, and specify `--log_dir`.
- Added support and helpers for operating on the `--log_dir` (natively provided by absl)
- Added support for `--follow` to `bin/run_test_server.py` and `bin/run_test_client.py` to follow pod logs printed to stdout
- Moved `PortForwarder` from k8s.py to its own file
The collection itself will be enabled per-suite in https://github.com/grpc/grpc/pull/30735.
* xDS interop: Fix default resource prefix
No longer just security tests.
This is done to avoid confusion when debugging resources managed
by the LB tests.
* s/xds/psm
All alternative server runners except the failover test reuse the primary server runners' namespace. Failover test is using the secondary cluster, and manages its own namespace there. `reuse_namespace` disables namespace cleanup, and in this case it was set to `True` incorrectly.
- Changes the order of waiting for pods to start: wait for the pods first, then for the deployment to transition to active. This should provide more useful information in the logs, showing exactly why the pod didn't start, instead of generic "Replicas not available" ref b/200293121. This also needed for https://github.com/grpc/grpc/pull/30594
- Add support for `check_result` callback in the retryer helpers
- Completely replaces `retrying` with `tenacity`, ref b/200293121. Retrying is not longer maintained.
- Improves the readability of timeout errors: now they contain the timeout (or the attempt number) exceeded, and information why the timeout failed (exception/check function):
Before:
> `tenacity.RetryError: RetryError[<Future at 0x7f8ce156bc18 state=finished returned dict>]`
After:
> `framework.helpers.retryers.RetryError: Retry error calling framework.infrastructure.k8s.KubernetesNamespace.get_pod: timeout 0:01:00 exceeded. Check result callback returned False.`
- Improves the readability of the k8s wait operation errors: now the log includes colorized and formatted status of the k8s object being watched, instead of dumping the full k8s object. For example, here's how an error caused by using incorrect TD bootstrap image:
* Enable outlier detection k8s interop test for Java. (#30641)
* xDS interop: enable outlier detection Java tests in >= 1.49.x
Co-authored-by: Terry Wilson <terrymwilson@gmail.com>
pod_name shouldn't be a part of the test app, it's purely k8s' idiom.
Originally server_id was intended for this purpose, but it was missed
when support for multiple server replicas added.
This replaces pod_name and server_id with hostname and improves
replica-specific log messages, so it's clear to what server
RPCs are issued.
In addition, now all RPC logs are annotated with the hostname:port,
so the destination is clear.
Before:
```
server_app.py:76] Setting health status to serving
grpc.py:60] RPC XdsUpdateHealthService.SetServing(request=Empty({}), timeout=90, wait_for_ready=True)
grpc.py:60] RPC Health.Check(request=HealthCheckRequest({}), timeout=90, wait_for_ready=True)
server_app.py:78] Server reports status: SERVING
```
After:
```
server_app.py:89] [psm-grpc-server-69bcf749c5-bg4x5] Setting health status to NOT_SERVING
grpc.py:72] [psm-grpc-server-69bcf749c5-bg4x5:52902] RPC XdsUpdateHealthService.SetNotServing(request=Empty({}), timeout=90, wait_for_ready=True)
grpc.py:72] [psm-grpc-server-69bcf749c5-bg4x5:52902] RPC Health.Check(request=HealthCheckRequest({}), timeout=90, wait_for_ready=True)
server_app.py:92] [psm-grpc-server-69bcf749c5-bg4x5] Health status status: NOT_SERVING
```
Similarly, this adds hostname to the client app, mainly for logging.
In python tests that require set_not_serving server RPC, override
the python server with the reference server (Java) because
the python server doesn't yet support set_not_serving RPC.
Ref https://github.com/grpc/grpc/issues/30635.