Our test clusters are in the broken/dirty state (except the URL map's
"basic" cluster, which isn't affected).
This PR switches to the newly created clusters to:
1. Get a data point on whether newly created clusters are affected
by the same issues.
2. Allow for descriptive work on the old clusters.
3. Hopefully, bring our tests back to green.
Bonus: more sensible cluster names.
- Added support for pod log collection. To enable, set `--collect_app_logs` flag, and specify `--log_dir`.
- Added support and helpers for operating on the `--log_dir` (natively provided by absl)
- Added support for `--follow` to `bin/run_test_server.py` and `bin/run_test_client.py` to follow pod logs printed to stdout
- Moved `PortForwarder` from k8s.py to its own file
The collection itself will be enabled per-suite in https://github.com/grpc/grpc/pull/30735.
- Changes the order of waiting for pods to start: wait for the pods first, then for the deployment to transition to active. This should provide more useful information in the logs, showing exactly why the pod didn't start, instead of generic "Replicas not available" ref b/200293121. This also needed for https://github.com/grpc/grpc/pull/30594
- Add support for `check_result` callback in the retryer helpers
- Completely replaces `retrying` with `tenacity`, ref b/200293121. Retrying is not longer maintained.
- Improves the readability of timeout errors: now they contain the timeout (or the attempt number) exceeded, and information why the timeout failed (exception/check function):
Before:
> `tenacity.RetryError: RetryError[<Future at 0x7f8ce156bc18 state=finished returned dict>]`
After:
> `framework.helpers.retryers.RetryError: Retry error calling framework.infrastructure.k8s.KubernetesNamespace.get_pod: timeout 0:01:00 exceeded. Check result callback returned False.`
- Improves the readability of the k8s wait operation errors: now the log includes colorized and formatted status of the k8s object being watched, instead of dumping the full k8s object. For example, here's how an error caused by using incorrect TD bootstrap image:
pod_name shouldn't be a part of the test app, it's purely k8s' idiom.
Originally server_id was intended for this purpose, but it was missed
when support for multiple server replicas added.
This replaces pod_name and server_id with hostname and improves
replica-specific log messages, so it's clear to what server
RPCs are issued.
In addition, now all RPC logs are annotated with the hostname:port,
so the destination is clear.
Before:
```
server_app.py:76] Setting health status to serving
grpc.py:60] RPC XdsUpdateHealthService.SetServing(request=Empty({}), timeout=90, wait_for_ready=True)
grpc.py:60] RPC Health.Check(request=HealthCheckRequest({}), timeout=90, wait_for_ready=True)
server_app.py:78] Server reports status: SERVING
```
After:
```
server_app.py:89] [psm-grpc-server-69bcf749c5-bg4x5] Setting health status to NOT_SERVING
grpc.py:72] [psm-grpc-server-69bcf749c5-bg4x5:52902] RPC XdsUpdateHealthService.SetNotServing(request=Empty({}), timeout=90, wait_for_ready=True)
grpc.py:72] [psm-grpc-server-69bcf749c5-bg4x5:52902] RPC Health.Check(request=HealthCheckRequest({}), timeout=90, wait_for_ready=True)
server_app.py:92] [psm-grpc-server-69bcf749c5-bg4x5] Health status status: NOT_SERVING
```
Similarly, this adds hostname to the client app, mainly for logging.
Separates xDS Test Client/Server (represent an interface to corresponding workload running remotely) from their runners (kubernetes-specific logic to provision the workloads with prerequisites).
This is a refactoring, should not change the behavior.
`kubernetes` library does not provide a way to configure the default socket timeout that will be used with `urllib3` it uses under the hood. And `urllib3` default socket timeout is infinity.
This PR sets the default socket timeout using python's `socket.setdefaulttimeout()` to 60 seconds.
This affects `urllib3` directly, and therefore `kubernetes`.
The changes is also picked up by the `google-api-python-client`, which does not use `urllib3` (it uses `httplib2`), but [respectes](https://googleapis.github.io/google-api-python-client/docs/epy/googleapiclient.http-module.html#build_http) `socket.setdefaulttimeout()`.
* [PSM Interop] Extend clean-up script to 2 other GKE clusters
* Use a safer apprach to invoke the cleanup script
* Handle the readonly issue
* Make sanity test happy
* Revert "Revert "[App Net] Switch Router to Mesh and Add unique string to Scope (#28145)" (#28176)"
This reverts commit cc968b2158.
* Allow scope to be None
* Add back references and scope field
* Set scope in router
* Reverse order of cleanup
* Add router_scope flag
* Use router_scope flag to create Router
* I apparently don't know how to brain
* Yapf
* Yeah, that can't be the default
* Remove debug print
* Remove impossible todos
* And another
* Switch from router-scope to config-scope
* Implement schema changes
* Use backend service URL
* Use CLH reference format to backend service
* I am an idiot
* *internal screaming*
* Try project number
* Why is this all awful
* Go back to trying project name
* Try cleaning things up
* Agh
* Address review comments
* Remove superfluous Optional type